library(tidyverse)
library(tidymodels)Regression with a flood prediction dataset
A kaggle competition - Playground Series - Season 4, Episode 5
Introduction
The purpose of this challenge consists in finding a model to predict the probability of flooding in a certain region, given certain features of the landscape.
The data set is made up from the Flood Prediction Factor data set. Data can be accessed through kaggle. In this article we shall assume the data are saved in the project directory in a sub-folder named ./data.
Loading the libraries and the data set
We shall use the following libraries
We shall also need some convenience functions to reshape the data. Precisely, since we wish to predict probabilities, it may be convenient to apply the logit function, to convert the original probabilities into real numbers, and to use the inverse logit function on the predicted values to convert them back into probabilities. The logit function is defined as follows:
x = \log\left(\dfrac{p}{1 - p}\right)
where x is a real number and p is a probability, i.e. a real number such that 0 < p < 1. We can also keep the extreme values, if we allow x to take the infinite values \pm\infty. The inverse of this function is
p = \dfrac{e^x}{1 + e^x}
In R there is no pre-built logit function so let’s create one to maintain the code neat:
logit <- function(p) {
log(p / (1 - p))
}
invlogit <- function(x) {
exp(x) / (1 + exp(x))
} Loading the data
The data set can be downloaded subscribing to the competition “Regression with a Food Prediciton Dataset” and gives us three files train.csv, test.csv and sample_sumbission.csv. Let us only load the train.csv document for the moment, using the janitor to make the variables names more readable.
floods <- read_csv("./data/train.csv") |>
janitor::clean_names()Exploratory Data Analysis
Let’s start by looking at the variables in the data set:
glimpse(floods)Rows: 1,117,957
Columns: 22
$ id <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
$ monsoon_intensity <dbl> 5, 6, 6, 3, 5, 5, 8, 6, 5, 4, 3, 7, …
$ topography_drainage <dbl> 8, 7, 5, 4, 3, 4, 3, 6, 2, 2, 7, 4, …
$ river_management <dbl> 5, 4, 6, 6, 2, 1, 1, 5, 8, 3, 2, 5, …
$ deforestation <dbl> 8, 4, 7, 5, 6, 4, 2, 7, 5, 5, 6, 4, …
$ urbanization <dbl> 6, 8, 3, 4, 4, 2, 3, 5, 4, 8, 6, 4, …
$ climate_change <dbl> 4, 8, 7, 8, 4, 4, 7, 5, 5, 6, 3, 2, …
$ dams_quality <dbl> 4, 3, 1, 4, 3, 6, 3, 3, 2, 5, 2, 4, …
$ siltation <dbl> 3, 5, 5, 7, 3, 6, 4, 5, 4, 5, 3, 6, …
$ agricultural_practices <dbl> 3, 4, 4, 6, 3, 7, 6, 5, 5, 7, 3, 6, …
$ encroachments <dbl> 4, 6, 5, 8, 3, 5, 7, 5, 5, 6, 2, 4, …
$ ineffective_disaster_preparedness <dbl> 2, 9, 6, 5, 5, 5, 5, 3, 2, 4, 6, 4, …
$ drainage_systems <dbl> 5, 7, 7, 2, 2, 3, 2, 5, 9, 6, 9, 6, …
$ coastal_vulnerability <dbl> 3, 2, 3, 4, 2, 5, 5, 3, 2, 3, 5, 6, …
$ landslides <dbl> 3, 0, 7, 7, 6, 5, 6, 5, 7, 3, 2, 2, …
$ watersheds <dbl> 5, 3, 5, 4, 6, 4, 4, 5, 3, 4, 5, 4, …
$ deteriorating_infrastructure <dbl> 4, 5, 6, 4, 4, 4, 5, 8, 4, 4, 4, 7, …
$ population_score <dbl> 7, 3, 8, 6, 1, 6, 6, 6, 6, 3, 5, 7, …
$ wetland_loss <dbl> 5, 3, 2, 5, 2, 8, 3, 8, 4, 3, 8, 8, …
$ inadequate_planning <dbl> 7, 4, 3, 7, 3, 3, 4, 5, 5, 5, 8, 3, …
$ political_factors <dbl> 3, 3, 3, 5, 5, 2, 6, 6, 5, 6, 5, 0, …
$ flood_probability <dbl> 0.445, 0.450, 0.530, 0.535, 0.415, 0…
They are all numerical variables, so let’s try to get a feeling of their distribution with a quick visualisation. Let us start from the (potential) predictors.
floods |>
select(!id) |>
pivot_longer(monsoon_intensity:political_factors) |>
ggplot(aes(x = value, y = after_stat(density))) +
geom_histogram(bins = 20, alpha = 0.8, fill = "midnightblue") +
facet_wrap(vars(name), scales = "free_y")The distribution appears bell-shaped and all variables seem to have very similar shape. Let us also have a look at the distribution of the outcome variables.
floods |>
ggplot(aes(x = flood_probability, y = after_stat(density))) +
geom_histogram(bins = 50, alpha = 0.8, fill = "midnightblue")Again quite a bell-shaped distribution, but with a horny tri-modal distribution.
Finally, let us transform the probability distribution into the logit probability.
floods_df <- floods |>
mutate(flood_logit = logit(flood_probability)) |>
select(!flood_probability)Building a model
Given the distribution and the type of variables, it seems reasonable to try a simple linear regression model. To check whether this model may be good enough, let us start by splitting the data into a training and testing pair and let us also create some cross-validation folds to analyse the model.
Spending our data budget
set.seed(123)
floods_split <- initial_split(floods_df)
floods_train <- training(floods_split)
floods_test <- testing(floods_split)
set.seed(234)
floods_folds <- vfold_cv(floods_train)Specifying the model
Let us now specify the model: in our case we shall use a simple linear regression model. In tidymodels we need to specify which type of model we want (regression or classification) and which engine we wish to use. We shall go for the simple linear model from base R.
lm_spec <- linear_reg() |>
set_mode("regression") |>
set_engine("lm")Pre-processing our data
The next step consists in deciding the sequence of pre-processing operations that we wish to apply to the data before applying the model. We definitely want to check whether any pair of variables is heavily correlated and exclude one of them, check that there is no spare data, and then to standardise every variable. This can be done in tidymodels using a recipe, as follows:
floods_rec <-
recipe(flood_logit ~ ., data = floods_train) |>
update_role(id, new_role = "id") |>
step_corr(all_predictors()) |>
step_nzv(all_predictors()) |>
step_normalize(all_predictors())Thus far, the recipe is only a list of actions:
floods_recIn order to see the recipe applied to the floods_train dataset we need to “prep” the recipe:
prep(floods_rec)And to see the actual result of this, we need to “juice” the pre-processed dataset out of the recipe. This can be done like so,
floods_rec |>
prep() |>
juice()# A tibble: 838,467 × 22
id monsoon_intensity topography_drainage river_management deforestation
<dbl> <dbl> <dbl> <dbl> <dbl>
1 134057 -1.42 -0.920 0.505 1.49
2 685284 -1.42 0.0358 -0.461 2.46
3 497689 -0.448 0.514 -1.91 -1.44
4 402857 -0.934 0.514 0.988 -0.948
5 583421 -0.934 1.47 0.988 1.49
6 883280 0.524 -0.442 -1.43 0.515
7 669542 -0.934 -0.920 1.95 -1.44
8 944671 -0.448 -0.920 0.0220 1.98
9 613996 -0.934 -0.442 0.505 -0.948
10 402312 0.524 -0.442 -1.43 0.515
# ℹ 838,457 more rows
# ℹ 17 more variables: urbanization <dbl>, climate_change <dbl>,
# dams_quality <dbl>, siltation <dbl>, agricultural_practices <dbl>,
# encroachments <dbl>, ineffective_disaster_preparedness <dbl>,
# drainage_systems <dbl>, coastal_vulnerability <dbl>, landslides <dbl>,
# watersheds <dbl>, deteriorating_infrastructure <dbl>,
# population_score <dbl>, wetland_loss <dbl>, inadequate_planning <dbl>, …
This is a bit of a controversial one, as the juice() function has been superseded in favour of bake(), but I personally prefer to use juice() and to only use bake() when the recipe is applied to a different dataset (e.g. the test data set).
Workflow
The final step of preparing the model is to put together the model specification and the recipe into a workflow() which will then collect all the information needed to train the model. This is quite straightforward to do (and pretty unnecessary at first glance, but it becomes more useful when a model needs training and successive refinements).
lm_wf <- workflow(floods_rec, lm_spec)Assessing the model
We can now proceed to assess the model by fitting the resamples on the cross-validation folds. What does it mean? Let’s first have a look at the cross-validation folds:
floods_folds# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [754620/83847]> Fold01
2 <split [754620/83847]> Fold02
3 <split [754620/83847]> Fold03
4 <split [754620/83847]> Fold04
5 <split [754620/83847]> Fold05
6 <split [754620/83847]> Fold06
7 <split [754620/83847]> Fold07
8 <split [754621/83846]> Fold08
9 <split [754621/83846]> Fold09
10 <split [754621/83846]> Fold10
As we can see the folds are made of two columns: splits and id. Every split is made of two parts of the floods_train dataset and the fit_resample() function will now train the model using the first element of the split, the analysis part, and will apply the model to collect prediction from the second element of the split, the assessment part, and collect the metrics for the accuracy of the prediction. This will be done for every split in the cross-validation folds. Let us also call the doParallel::registerDoParallel() function to speed up the process.
doParallel::registerDoParallel()
floods_res <-
fit_resamples(
lm_wf,
resamples = floods_folds,
control = control_resamples(save_pred = TRUE)
)We can now collect the metrics
collect_metrics(floods_res)# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.0809 10 0.0000744 Preprocessor1_Model1
2 rsq standard 0.846 10 0.000272 Preprocessor1_Model1
and we can also make a visualisation that shows how the model performs:
collect_predictions(floods_res) |>
mutate(.pred = invlogit(.pred),
flood_probability = invlogit(flood_logit)) |>
ggplot(aes(x = flood_probability, y = .pred, colour = id)) +
geom_abline(linetype = "dashed", linewidth = 1, colour = "gray30") +
geom_point(size = 2, alpha = 0.1) +
coord_equal() +
guides(colour = guide_legend(override.aes = list(alpha=1)))We can see that the model is perhaps not perfect, but in the balance between simplicity and accuracy, it seems a good compromise.
Last fit
We can now proceed to the final fit, using the whole training set to train the model, and applying it to the testing set to check how the model performs on a set that is not involved in the training.
fitted_model <- last_fit(lm_wf, floods_split)Again we can collect the metrics and make another visualisation.
collect_metrics(fitted_model)# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.0809 Preprocessor1_Model1
2 rsq standard 0.847 Preprocessor1_Model1
collect_predictions(fitted_model) |>
ggplot(aes(x = invlogit(flood_logit), y = invlogit(.pred))) +
geom_abline(colour = "gray50", linewidth = 1, linetype = "dashed") +
geom_point(size = 2, alpha = 0.4, colour = "midnightblue") +
coord_equal() +
labs(x = "True probability value", y = "Predicted probability value")Applying the model to a new data set
Once we have our fitted model, we can extract the workflow and use it for other predictions. In particular, we shall use it to assess the test.csv data set given in the competition. Let’s first extract the workflow:
model <- extract_workflow(fitted_model)Now we can load the new data
floods_new <- read_csv("./data/test.csv") |>
janitor::clean_names()and apply model to this new set. The competition requires only two columns, one with the id and one with the probability of flood, FloodProbabiltiy. We can therefore save that into a variable called submission.
submission <-
augment(model, new_data = floods_new) |>
mutate(flood_probability = invlogit(.pred)) |>
select(id, FloodProbability = flood_probability)The final thing to do to compete the competition is to save the submission variable into a csv file and submit it to kaggle to have it checked.
write_csv(result, "./new-submission.csv")