The objective of this code is to predict rainfall using simulated climate variables (temperature, pressure, humidity, wind) through various modeling approaches, ranging from linear and generalized regression to advanced models like M5P regression trees. The focus is on building and comparing predictive models, validating their performance using train/test splits and cross-validation, and quantitatively evaluating predictions with metrics such as RMSE and R². This code enables the exploration of model robustness, identification of the most influential variables, and visualization of model fit through plots comparing observed and predicted values, making the analysis both educational and applicable to real climate datasets.
The following packages are required for the analysis. Install them
using: install.packages('package_name'). Note: For the M5P
model, ensure Java is installed (e.g., JDK from https://www.oracle.com/java/technologies/downloads/) and
configure rJava accordingly.
library(rJava)
library(RWeka)
library(Cubist)
library(party)
library(modeldata)
library(rpart)
library(caret)
library(klaR)
library(olsrr)
library(leaps)
library(glmulti)
library(datasets)
library(car)
library(randomForest)
library(mlbench)
library(caret)
library(readxl)
library(dplyr)
library(ggplot2)
library(lattice)
library(rpart)
library(Metrics)
library(glmnet)
library(Matrix)
library(kableExtra)
library(hydroGOF)
library(neuralnet)
system("java -version")
## [1] 0
A synthetic climate dataset is generated with 100 observations and 20 variables, including rainfall, temperature, humidity, and wind speed. A subset of key variables is selected for modeling.
set.seed(123) # For reproducibility
n <- 100 # Number of observations
data <- data.frame(
rainfall = rnorm(n, mean = 50, sd = 20), # mm
temperature = rnorm(n, mean = 27, sd = 5), # °C
mslp = rnorm(n, mean = 1013, sd = 10), # hPa
humidity = rnorm(n, mean = 70, sd = 15), # %
wind_speed = rnorm(n, mean = 5, sd = 2), # m/s
wind_gust = rnorm(n, mean = 10, sd = 4),
wind_dir = runif(n, min = 0, max = 360), # degrees
sunshine = rnorm(n, mean = 8, sd = 3), # hours
cloud_cover = rnorm(n, mean = 50, sd = 20), # %
evapotransp = rnorm(n, mean = 4, sd = 1.2), # mm
soil_moisture = rnorm(n, mean = 30, sd = 10), # %
dew_point = rnorm(n, mean = 20, sd = 3), # °C
vapor_pressure = rnorm(n, mean = 25, sd = 5), # hPa
radiation = rnorm(n, mean = 200, sd = 40), # W/m²
snowfall = rnorm(n, mean = 2, sd = 1), # mm
cyclone_freq = rpois(n, lambda = 1), # count/year
drought_index = rnorm(n, mean = 0, sd = 1), # normalized index
heatwaves = rpois(n, lambda = 3), # count/year
storm_surge = rnorm(n, mean = 0.5, sd = 0.2), # m
sea_temp = rnorm(n, mean = 28, sd = 2) # °C
)
# Select key variables for modeling
data1 <- data[, c("rainfall", "temperature", "mslp", "humidity", "wind_speed")]
head(data1)
## rainfall temperature mslp humidity wind_speed
## 1 38.79049 23.44797 1034.988 59.27137 4.852888
## 2 45.39645 28.28442 1026.124 58.70967 2.662697
## 3 81.17417 25.76654 1010.349 55.92192 3.730503
## 4 51.41017 25.26229 1018.432 54.21230 4.942317
## 5 52.58575 22.24191 1008.857 63.44261 6.341392
## 6 84.30130 26.77486 1008.238 74.96769 1.698907
str(data1)
## 'data.frame': 100 obs. of 5 variables:
## $ rainfall : num 38.8 45.4 81.2 51.4 52.6 ...
## $ temperature: num 23.4 28.3 25.8 25.3 22.2 ...
## $ mslp : num 1035 1026 1010 1018 1009 ...
## $ humidity : num 59.3 58.7 55.9 54.2 63.4 ...
## $ wind_speed : num 4.85 2.66 3.73 4.94 6.34 ...
The dataset is split into training (73.68%) and validation (26.32%) sets to evaluate model performance.
set.seed(123)
partition <- createDataPartition(y = data1$rainfall, p = 0.7368, list = FALSE)
trainingdata <- data1[partition, ]
validationdata <- data1[-partition, ]
Linear and generalized linear regression models are trained on the training data, and predictions are made on the validation set.
# Linear regression
model <- lm(rainfall ~ ., data = trainingdata)
p <- predict(model, validationdata)
# Generalized linear regression
modelg <- glm(rainfall ~ ., data = trainingdata)
pg <- predict(modelg, validationdata)
A 10-fold cross-validation is performed on the linear regression model to assess its robustness.
modelcv <- train(rainfall ~ ., data = trainingdata,
method = "lm",
trControl = trainControl(method = "cv", number = 10))
pcv <- predict(modelcv, validationdata)
An M5P regression tree model is trained on the full dataset, and its performance is evaluated.
fit <- M5P(rainfall ~ ., data = data1)
predictions <- predict(fit, data1)
summary(fit)
##
## === Summary ===
##
## Correlation coefficient 0
## Mean absolute error 16.6471
## Root mean squared error 20.3919
## Relative absolute error 114.1338 %
## Root relative squared error 112.2606 %
## Total Number of Instances 100
# Evaluation
obs <- data1$rainfall
gof(predictions, obs)
## [,1]
## ME 9.27
## MAE 16.65
## MSE 415.83
## RMSE 20.39
## NRMSE % 111.70
## PBIAS % 17.90
## RSR 1.12
## rSD 0.00
## NSE -0.26
## mNSE -0.14
## rNSE -22.14
## d 0.39
## md 0.30
## rd -10.13
## cp 0.39
## r NA
## R2 NA
## bR2 NA
## KGE NA
## VE 0.68
# Comparison of predictions and observations
cbind(rownames(data1), predictions, obs)
## predictions obs
## [1,] "1" "61.0747382902458" "38.7904870689557"
## [2,] "2" "61.0747382902458" "45.3964502103344"
## [3,] "3" "61.0747382902458" "81.1741662829825"
## [4,] "4" "61.0747382902458" "51.4101678284915"
## [5,] "5" "61.0747382902458" "52.5857547032189"
## [6,] "6" "61.0747382902458" "84.3012997376656"
## [7,] "7" "61.0747382902458" "59.218324119784"
## [8,] "8" "61.0747382902458" "24.6987753078693"
## [9,] "9" "61.0747382902458" "36.2629429621295"
## [10,] "10" "61.0747382902458" "41.0867605980008"
## [11,] "11" "61.0747382902458" "74.4816359487892"
## [12,] "12" "61.0747382902458" "57.1962765411473"
## [13,] "13" "61.0747382902458" "58.015429011881"
## [14,] "14" "61.0747382902458" "52.2136543189024"
## [15,] "15" "61.0747382902458" "38.8831773049185"
## [16,] "16" "61.0747382902458" "85.7382627360616"
## [17,] "17" "61.0747382902458" "59.9570095645848"
## [18,] "18" "61.0747382902458" "10.6676568674072"
## [19,] "19" "61.0747382902458" "64.0271180312737"
## [20,] "20" "61.0747382902458" "40.5441718454413"
## [21,] "21" "61.0747382902458" "28.6435258802631"
## [22,] "22" "61.0747382902458" "45.6405017068341"
## [23,] "23" "61.0747382902458" "29.4799110338552"
## [24,] "24" "61.0747382902458" "35.4221754141772"
## [25,] "25" "61.0747382902458" "37.4992146430149"
## [26,] "26" "61.0747382902458" "16.2661337851517"
## [27,] "27" "61.0747382902458" "66.7557408898905"
## [28,] "28" "61.0747382902458" "53.0674623567303"
## [29,] "29" "61.0747382902458" "27.237261259761"
## [30,] "30" "61.0747382902458" "75.0762984213985"
## [31,] "31" "61.0747382902458" "58.5292844295363"
## [32,] "32" "61.0747382902458" "44.0985703401546"
## [33,] "33" "61.0747382902458" "67.9025132209004"
## [34,] "34" "61.0747382902458" "67.5626697506608"
## [35,] "35" "61.0747382902458" "66.4316216327497"
## [36,] "36" "61.0747382902458" "63.7728050820018"
## [37,] "37" "61.0747382902458" "61.0783530707518"
## [38,] "38" "61.0747382902458" "48.7617657884656"
## [39,] "39" "61.0747382902458" "43.8807467252017"
## [40,] "40" "61.0747382902458" "42.3905799797523"
## [41,] "41" "61.0747382902458" "36.1058604215897"
## [42,] "42" "61.0747382902458" "45.841654439608"
## [43,] "43" "61.0747382902458" "24.6920729686347"
## [44,] "44" "61.0747382902458" "93.3791193067702"
## [45,] "45" "61.0747382902458" "74.1592399660998"
## [46,] "46" "61.0747382902458" "27.537828335933"
## [47,] "47" "61.0747382902458" "41.9423032940185"
## [48,] "48" "61.0747382902458" "40.6668929275356"
## [49,] "49" "61.0747382902458" "65.5993023667264"
## [50,] "50" "61.0747382902458" "48.3326186705634"
## [51,] "51" "61.0747382902458" "55.0663702798951"
## [52,] "52" "61.0747382902458" "49.4290648930259"
## [53,] "53" "61.0747382902458" "49.1425908541737"
## [54,] "54" "61.0747382902458" "77.3720456802891"
## [55,] "55" "61.0747382902458" "45.4845802868146"
## [56,] "56" "61.0747382902458" "80.3294120885908"
## [57,] "57" "61.0747382902458" "19.0249439153956"
## [58,] "58" "61.0747382902458" "61.6922749927214"
## [59,] "59" "61.0747382902458" "52.4770848768923"
## [60,] "60" "61.0747382902458" "54.3188313748795"
## [61,] "61" "61.0747382902458" "57.5927896551976"
## [62,] "62" "61.0747382902458" "39.953530937814"
## [63,] "63" "61.0747382902458" "43.3358523266116"
## [64,] "64" "61.0747382902458" "29.6284923378582"
## [65,] "65" "61.0747382902458" "28.5641754704884"
## [66,] "66" "61.0747382902458" "56.0705728280852"
## [67,] "67" "61.0747382902458" "58.9641955725885"
## [68,] "68" "61.0747382902458" "51.0600845346101"
## [69,] "69" "61.0747382902458" "68.4453493575948"
## [70,] "70" "61.0747382902458" "91.0016937125429"
## [71,] "71" "61.0747382902458" "40.1793766788693"
## [72,] "72" "61.0747382902458" "3.81662248718375"
## [73,] "73" "61.0747382902458" "70.1147704892451"
## [74,] "74" "61.0747382902458" "35.8159847483522"
## [75,] "75" "61.0747382902458" "36.2398276706528"
## [76,] "76" "61.0747382902458" "70.511427393934"
## [77,] "77" "61.0747382902458" "44.3045398589798"
## [78,] "78" "61.0747382902458" "25.5856457549093"
## [79,] "79" "61.0747382902458" "53.626069594983"
## [80,] "80" "61.0747382902458" "47.2221727512191"
## [81,] "81" "61.0747382902458" "50.1152837179977"
## [82,] "82" "61.0747382902458" "57.7056080225266"
## [83,] "83" "61.0747382902458" "42.5867993641518"
## [84,] "84" "61.0747382902458" "62.8875309703767"
## [85,] "85" "61.0747382902458" "45.590268763625"
## [86,] "86" "61.0747382902458" "56.6356392783139"
## [87,] "87" "61.0747382902458" "71.936780262987"
## [88,] "88" "61.0747382902458" "58.7036298166761"
## [89,] "89" "61.0747382902458" "43.4813682893755"
## [90,] "90" "61.0747382902458" "72.9761523690219"
## [91,] "91" "61.0747382902458" "69.8700771192424"
## [92,] "92" "61.0747382902458" "60.9679391901614"
## [93,] "93" "61.0747382902458" "54.7746347022288"
## [94,] "94" "61.0747382902458" "37.4418784792126"
## [95,] "95" "61.0747382902458" "77.2130489706002"
## [96,] "96" "61.0747382902458" "37.9948082570575"
## [97,] "97" "61.0747382902458" "93.7466598603315"
## [98,] "98" "61.0747382902458" "80.6522125237038"
## [99,] "99" "61.0747382902458" "45.2859928179905"
## [100,] "100" "61.0747382902458" "29.4715819938644"
Visualizations are created to compare observed and predicted rainfall values and to assess model residuals.
# Observed vs Predicted (M5P)
df_pred <- data.frame(obs = obs, pred = predictions)
ggplot(df_pred, aes(x = obs, y = pred)) +
geom_point(color = "blue", alpha = 0.6, size = 2) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1) +
labs(title = "Observed vs Predicted Rainfall (M5P)",
x = "Observed Rainfall (mm)",
y = "Predicted Rainfall (mm)") +
theme_minimal(base_size = 14)
# Residual plots for linear model
par(mfrow = c(1,2))
plot(model, which = 1) # Residuals vs Fitted
plot(model, which = 2) # QQ-plot of residuals
This project serves as an educational exercise, demonstrating the application of various modeling techniques, from linear regression to M5P regression trees, in a controlled environment. The methodology, including train/test splits, cross-validation, and performance evaluation, can be applied to real-world climate data analysis, providing insights into predictive modeling and variable importance.author: “Abdi-Basid ADAN” email: “abdi-basid@outlook.com”