Some words about the context

Harmful algae blooms represent a significant threat to human health and aquatic ecosystem because they can produce some dangerous toxins. The phenomenon is the result of a eutrophication of the water bodies. Three conditions favour the algae bloom: enough light (Photosynthetically active radiation), slow water movement (ponds and estuary environments are more likely to have algae blooms) and excess of nutrients (nitrate and phosphate).

The dataset is from the 1999 Computational Intelligence and Learning (COIL) competition. Available on https://archive.ics.uci.edu/ml/datasets/Coil+1999+Competition+Data and included in the package DMwR from Data Mining with R: Learning with Case Studies by Luis Torgo which is a very interesting book. Highly recommended!

Dataset

“The data contains measurements of river chemical concentrations and algae densities. There are a total of 340 examples each containing 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each example are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX.” - UCI

Objective

The first objective of this rpubs is to practice the mlr package to predict the “a1” algae density.

0. Load data

library(DMwR)
library(mlr)
library(dplyr)
library(gridExtra)
library(corrplot)

data(algae)
df <- algae # Training set
df1 <- cbind(test.algae,algae.sols) #Testing set

1. Boxplot each variable ~ season, size and speed (3S)

Create function plotBox and doPlots

plotBox <- function(data.in, op1, i) {
  varname <- colnames(data.in)[i]
  data <- data.in %>% select_(lazyeval::interp(~op1, op1 = as.name(op1)),
                              lazyeval::interp(~varname, varname = as.name(varname)))
 
  colnames(data)[1] <- "cat"
  colnames(data)[2] <- "value"
  p<- ggplot(data, aes (x = factor(cat), y = value, fill = cat)) + geom_boxplot() +
    ylab(varname) + theme_bw() + theme(axis.title.x = element_blank()) + theme(legend.position = "none")
  return (p)
}

doPlots <- function(ii, op, ncol=3) {
  pp <- list()
  for (i in ii) {
    p <- plotBox(data.in = df, i=i, op1 = op)
    pp <- c(pp, list(p))
  }
  do.call("grid.arrange", c(pp, ncol=ncol))
}

Boxplot by season

Boxplot by speed

Boxplot by size

2. Correlation ~ season, size and speed(3S)

Look at the correlation between variables with corrplot

3. Imputation NA values

There are 16 uncompleted observations on 200 observations. It represents 8%. I can train our models on 184 observations. However, in this rpubs, we try to fill all the NA values

After analyzing the boxplots and correlations between variables. I decided to impute the NA values of pH, oPO4, PO4 and Chlorophyll a(Chla) based on size, O2 on season and other variables with no distinction. The NA values are replaced by its corresponding average values.

Function imputation

#Imputation by 3S
averg <- function(x) ifelse(is.na(x),mean(x,na.rm = T),x)
imputation <- function(df,opt1,n1,n2) {
for(i in n1:n2){
  col = i
varname <- colnames(df)[col]  
expr <- lazyeval::interp(~averg(x), x = as.name(varname))

df <- df %>% group_by_(lazyeval::interp(~opt1, opt1 = as.name(opt1))) %>% 
  mutate_(.dots=setNames(list(expr),varname)) %>% ungroup()
}
return(df)
}

Imputation for train data

#Imputation for train data

df_imp <- imputation(df,opt1 = "size",4,4) #imputation for pH
df_imp <- imputation(df_imp,opt1 = "season",5,5) #imputation for O2
df_imp <- imputation(df_imp,opt1 = "size",9,11) #imputation for oPO4,PO4, Chla
df_imp <- df_imp %>%  mutate(Cl = ifelse(is.na(Cl),mean(Cl,na.rm = T),Cl),
                             NO3 = ifelse(is.na(NO3),mean(NO3,na.rm = T),NO3),
                             NH4 = ifelse(is.na(NH4),mean(NH4,na.rm = T),NH4))

4. Benchmark models

I use mlr package to benchmark 4 models: GLM, knn, SVM and RF. Let’s see which model performs well. I remind that I only predict the “a1” algae here.

#Keep only the a1 algae
df_imp <- df_imp[,1:12]
df1_imp <- df1_imp[,1:12]

#Create task
train.task = makeRegrTask(id = "TrainAlgae", data = df_imp, target = "a1")
test.task = makeRegrTask(id = "TestAlgae", data = df1_imp, target = "a1")

#Make learner
learners <- list(makeLearner("regr.glm", id = "glm") ,
                 makeLearner("regr.kknn", id = "knn") ,
                 makeLearner("regr.ksvm", id = "svm") ,
                 makeLearner("regr.ranger", id = "RF"))

#Cross validation with 20 iterations
resmp= makeResampleDesc("CV", iters = 20)

set.seed(123) 
bench <- benchmark(learners,train.task,resmp)

getBMRAggrPerformances(bench)

## $TrainAlgae
## $TrainAlgae$glm
## mse.test.mean 
##      373.8525 
## 
## $TrainAlgae$knn
## mse.test.mean 
##      352.8755 
## 
## $TrainAlgae$svm
## mse.test.mean 
##      294.5525 
## 
## $TrainAlgae$RF
## mse.test.mean 
##      246.9343

#bmrkdata <- getBMRPerformances(bench, as.df = TRUE)
plotBMRBoxplots(bench, measure = mse)+aes(fill=learner.id)+coord_flip()+theme_bw()

The RF performs better than other models

5. Feature importance

Now, i decide to choose RF as final model. But at first, i need to know the contribution of each variable to predict the a1 algae. After that, i retain only 5 most important variables and see whether the model is improved

fv = generateFilterValuesData(train.task, method = c("cforest.importance"))
plotFilterValues(fv)

## Keep the 5 most important features
filtered.task = filterFeatures(train.task, method = "cforest.importance", abs = 5)

#Fuse
lrn <-  makeFilterWrapper(learner = "regr.ranger", fw.method = "cforest.importance", fw.abs = 5)
rdesc = makeResampleDesc("CV", iters = 20)
r = resample(learner = lrn, task = train.task, resampling = rdesc, show.info = FALSE, models = TRUE)
r$aggr

## mse.test.mean 
##      248.3011

The model is not improved. However, the difference of MSE is weak. I decided to retain only these 5 variables into the final model because we can optimize the time computing.

6. Final Model

df2 <- select(df_imp,PO4,oPO4,Cl,size,mxPH,a1)
df3 <- select(df1_imp,PO4,oPO4,Cl,size,mxPH,a1)
learner_rf <- makeLearner("regr.ranger", id = "RF")

train.task = makeRegrTask(id = "TrainAlgae", data = df2, target = "a1")
test.task = makeRegrTask(id = "TestAlgae", data = df3, target = "a1")

set.seed(1111)
rf <- train(learner_rf,train.task)
rf_pred <- predict(rf, task = test.task)
res <- rf_pred$data
cor(res$truth,res$response)^2

## [1] 0.5344292

The RF model can explain 53 % of variance. It’s not a good prediction however i think we cannot expect much better because biological proccesses are very complex. More parameters and measurements might be necessary to have a good prediction

Predict_algae_blooms. Practice with mlr package

Vịt Trần

January 9, 2017