Harmful algae blooms represent a significant threat to human health and aquatic ecosystem because they can produce some dangerous toxins. The phenomenon is the result of a eutrophication of the water bodies. Three conditions favour the algae bloom: enough light (Photosynthetically active radiation), slow water movement (ponds and estuary environments are more likely to have algae blooms) and excess of nutrients (nitrate and phosphate).
The dataset is from the 1999 Computational Intelligence and Learning (COIL) competition. Available on https://archive.ics.uci.edu/ml/datasets/Coil+1999+Competition+Data and included in the package DMwR from Data Mining with R: Learning with Case Studies by Luis Torgo which is a very interesting book. Highly recommended!
“The data contains measurements of river chemical concentrations and algae densities. There are a total of 340 examples each containing 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each example are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX.” - UCI
The first objective of this rpubs is to practice the mlr package to predict the “a1” algae density.
library(DMwR)
library(mlr)
library(dplyr)
library(gridExtra)
library(corrplot)
data(algae)
df <- algae # Training set
df1 <- cbind(test.algae,algae.sols) #Testing set
Create function plotBox and doPlots
plotBox <- function(data.in, op1, i) {
varname <- colnames(data.in)[i]
data <- data.in %>% select_(lazyeval::interp(~op1, op1 = as.name(op1)),
lazyeval::interp(~varname, varname = as.name(varname)))
colnames(data)[1] <- "cat"
colnames(data)[2] <- "value"
p<- ggplot(data, aes (x = factor(cat), y = value, fill = cat)) + geom_boxplot() +
ylab(varname) + theme_bw() + theme(axis.title.x = element_blank()) + theme(legend.position = "none")
return (p)
}
doPlots <- function(ii, op, ncol=3) {
pp <- list()
for (i in ii) {
p <- plotBox(data.in = df, i=i, op1 = op)
pp <- c(pp, list(p))
}
do.call("grid.arrange", c(pp, ncol=ncol))
}
Look at the correlation between variables with corrplot
There are 16 uncompleted observations on 200 observations. It represents 8%. I can train our models on 184 observations. However, in this rpubs, we try to fill all the NA values
After analyzing the boxplots and correlations between variables. I decided to impute the NA values of pH, oPO4, PO4 and Chlorophyll a(Chla) based on size, O2 on season and other variables with no distinction. The NA values are replaced by its corresponding average values.
#Imputation by 3S
averg <- function(x) ifelse(is.na(x),mean(x,na.rm = T),x)
imputation <- function(df,opt1,n1,n2) {
for(i in n1:n2){
col = i
varname <- colnames(df)[col]
expr <- lazyeval::interp(~averg(x), x = as.name(varname))
df <- df %>% group_by_(lazyeval::interp(~opt1, opt1 = as.name(opt1))) %>%
mutate_(.dots=setNames(list(expr),varname)) %>% ungroup()
}
return(df)
}
#Imputation for train data
df_imp <- imputation(df,opt1 = "size",4,4) #imputation for pH
df_imp <- imputation(df_imp,opt1 = "season",5,5) #imputation for O2
df_imp <- imputation(df_imp,opt1 = "size",9,11) #imputation for oPO4,PO4, Chla
df_imp <- df_imp %>% mutate(Cl = ifelse(is.na(Cl),mean(Cl,na.rm = T),Cl),
NO3 = ifelse(is.na(NO3),mean(NO3,na.rm = T),NO3),
NH4 = ifelse(is.na(NH4),mean(NH4,na.rm = T),NH4))
I use mlr package to benchmark 4 models: GLM, knn, SVM and RF. Let’s see which model performs well. I remind that I only predict the “a1” algae here.
#Keep only the a1 algae
df_imp <- df_imp[,1:12]
df1_imp <- df1_imp[,1:12]
#Create task
train.task = makeRegrTask(id = "TrainAlgae", data = df_imp, target = "a1")
test.task = makeRegrTask(id = "TestAlgae", data = df1_imp, target = "a1")
#Make learner
learners <- list(makeLearner("regr.glm", id = "glm") ,
makeLearner("regr.kknn", id = "knn") ,
makeLearner("regr.ksvm", id = "svm") ,
makeLearner("regr.ranger", id = "RF"))
#Cross validation with 20 iterations
resmp= makeResampleDesc("CV", iters = 20)
set.seed(123)
bench <- benchmark(learners,train.task,resmp)
getBMRAggrPerformances(bench)
## $TrainAlgae
## $TrainAlgae$glm
## mse.test.mean
## 373.8525
##
## $TrainAlgae$knn
## mse.test.mean
## 352.8755
##
## $TrainAlgae$svm
## mse.test.mean
## 294.5525
##
## $TrainAlgae$RF
## mse.test.mean
## 246.9343
#bmrkdata <- getBMRPerformances(bench, as.df = TRUE)
plotBMRBoxplots(bench, measure = mse)+aes(fill=learner.id)+coord_flip()+theme_bw()
The RF performs better than other models
Now, i decide to choose RF as final model. But at first, i need to know the contribution of each variable to predict the a1 algae. After that, i retain only 5 most important variables and see whether the model is improved
fv = generateFilterValuesData(train.task, method = c("cforest.importance"))
plotFilterValues(fv)
## Keep the 5 most important features
filtered.task = filterFeatures(train.task, method = "cforest.importance", abs = 5)
#Fuse
lrn <- makeFilterWrapper(learner = "regr.ranger", fw.method = "cforest.importance", fw.abs = 5)
rdesc = makeResampleDesc("CV", iters = 20)
r = resample(learner = lrn, task = train.task, resampling = rdesc, show.info = FALSE, models = TRUE)
r$aggr
## mse.test.mean
## 248.3011
The model is not improved. However, the difference of MSE is weak. I decided to retain only these 5 variables into the final model because we can optimize the time computing.
df2 <- select(df_imp,PO4,oPO4,Cl,size,mxPH,a1)
df3 <- select(df1_imp,PO4,oPO4,Cl,size,mxPH,a1)
learner_rf <- makeLearner("regr.ranger", id = "RF")
train.task = makeRegrTask(id = "TrainAlgae", data = df2, target = "a1")
test.task = makeRegrTask(id = "TestAlgae", data = df3, target = "a1")
set.seed(1111)
rf <- train(learner_rf,train.task)
rf_pred <- predict(rf, task = test.task)
res <- rf_pred$data
cor(res$truth,res$response)^2
## [1] 0.5344292
The RF model can explain 53 % of variance. It’s not a good prediction however i think we cannot expect much better because biological proccesses are very complex. More parameters and measurements might be necessary to have a good prediction