Data Mining Case Study: Algal Blooms - Linear Models
Author
Dr Andrew Dalby
Background
When we are trying to carry out any scientific investigation there are often things that are easy to measure and things that are hard to measure. What we want to do is use the easy measures to make predictions about the hard things to measure.
An example is the Iris data collected by Fisher. There are different species of Iris but to identify which is which you need an Iris/plant expert. Fisher collected measurements of lengths and widths related to the flowers to try and use these to automate the process of classification without needing an expert. This is often used as a classic example of data-mining in trying to successfully classify the Irises into the three different species based on this data.
The first case out of the Data Mining with R textbook is about algal blooms in rivers. Here is the easy to measure data is the chemical monitoring of the river. The data you want to predict - the hard to collect data is the biological data of the algal blooms.
Algal Bloom Dataset
Collected as part of the ERUDIT research network it has been used in data analysis competitions and it is part of the UCI Machine Learning Repository. The training set consists of 200 samples and the testing set contains 140 samples. Each observation contains 11 variables. Three are nominal and describe the season and river volume of collection. The other 8 are chemical parameters:
Maximum pH value
Minimum value of oxygen
Mean value of chloride
Mean value of nitrates
Mean value of ammonium
Mean value of orthophosphate
Mean value of total phosphate
Mean of chlorophyll
For the 200 training set data-points there are also the frequencies of 7 different harmful algal species.
In this case unlike the Iris data case where the output is a classification system this data requires a predictive model. You could consider that the variables are correlated but the argument is that these chemical levels contribute to the algal blooms and so they are part of the cause of the algal blooms.
library("DMwR2")
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library("car")
Cargando paquete requerido: carData
library("ggplot2")library("grid")library("Hmisc")
Adjuntando el paquete: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, units
When you have any dataset then it is important for you to summarise the data and carry out some exploratory data analysis before you start doing any model building.
summary(algae)
season size speed mxPH mnO2
autumn:40 large :45 high :84 Min. :5.600 Min. : 1.500
spring:53 medium:84 low :33 1st Qu.:7.700 1st Qu.: 7.725
summer:45 small :71 medium:83 Median :8.060 Median : 9.800
winter:62 Mean :8.012 Mean : 9.118
3rd Qu.:8.400 3rd Qu.:10.800
Max. :9.700 Max. :13.400
NA's :1 NA's :2
Cl NO3 NH4 oPO4
Min. : 0.222 Min. : 0.050 Min. : 5.00 Min. : 1.00
1st Qu.: 10.981 1st Qu.: 1.296 1st Qu.: 38.33 1st Qu.: 15.70
Median : 32.730 Median : 2.675 Median : 103.17 Median : 40.15
Mean : 43.636 Mean : 3.282 Mean : 501.30 Mean : 73.59
3rd Qu.: 57.824 3rd Qu.: 4.446 3rd Qu.: 226.95 3rd Qu.: 99.33
Max. :391.500 Max. :45.650 Max. :24064.00 Max. :564.60
NA's :10 NA's :2 NA's :2 NA's :2
PO4 Chla a1 a2
Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000
1st Qu.: 41.38 1st Qu.: 2.000 1st Qu.: 1.50 1st Qu.: 0.000
Median :103.29 Median : 5.475 Median : 6.95 Median : 3.000
Mean :137.88 Mean : 13.971 Mean :16.92 Mean : 7.458
3rd Qu.:213.75 3rd Qu.: 18.308 3rd Qu.:24.80 3rd Qu.:11.375
Max. :771.60 Max. :110.456 Max. :89.80 Max. :72.600
NA's :2 NA's :12
a3 a4 a5 a6
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 1.550 Median : 0.000 Median : 1.900 Median : 0.000
Mean : 4.309 Mean : 1.992 Mean : 5.064 Mean : 5.964
3rd Qu.: 4.925 3rd Qu.: 2.400 3rd Qu.: 7.500 3rd Qu.: 6.925
Max. :42.800 Max. :44.600 Max. :44.400 Max. :77.600
a7
Min. : 0.000
1st Qu.: 0.000
Median : 1.000
Mean : 2.495
3rd Qu.: 2.400
Max. :31.600
The original book used base-R for the visualisation but you can do more with ggplot2 which was used for the second edition.
qqPlot(algae$mxPH,main='Normal QQ plot of maximum pH',ylab="")
[1] 56 57
We can also examine the outputs as well as the predictive variables in the same way. In this case using boxplots and violin plots to see if there are differences in algal blooms for species A1 between the different river sizes.
ggplot(algae, aes(x=size, y=a1))+geom_boxplot()+labs(x="River Size", y="Amount of Algae A1")
River size does not seem to have a simple relationship to the amount of algae A1. Small rivers have less small amounts but this might be because small rivers have a faster flow rate.
When you have large numbers of continuous variables and you think that there might be an association with a categorical variable that contributes to the response variable/outcome then you can try to stratify the data into levels containing an equal number of cases and then plot the results in a strip-plot. In this case if you think that the minimum amount of oxygen produces different patterns for blooms for algae A1 in between the different seasons, you would produce the following plot.
These results suggests that there is no seasonality between minimum oxygen levels and the amount of algae A1, except at the lowest levels where there are less cases in the autumn.
Another useful tool for summary plots is the GGally library
library("GGally")ggpairs(algae,columns=1:5)
ggpairs(algae, columns=4:11)
Scatterplots
As we are going to construct a linear model it also makes sense to plot the scatterplots for the variables.
ggplot(data=algae, aes(x=mxPH, y=a1, color=season))+geom_point()+geom_smooth(method="lm") +xlab("Maximum pH") +ylab("Amount of Algae A1")
ggplot(data=algae, aes(x=mnO2, y=a1, color=season))+geom_point()+geom_smooth(method="lm") +xlab("Minimum Oxygen") +ylab("Amount of Algae A1")
ggplot(data=algae, aes(x=PO4, y=a1, color=season))+geom_point()+geom_smooth(method="lm") +xlab("Phosphate") +ylab("Amount of Algae A1")
ggplot(data=algae, aes(x=Cl, y=a1, color=season))+geom_point()+geom_smooth(method="lm") +xlab("Chloride") +ylab("Amount of Algae A1")
ggplot(data=algae, aes(x=NO3, y=a1, color=season))+geom_point()+geom_smooth(method="lm") +xlab("Nitrate") +ylab("Amount of Algae A1")
ggplot(data=algae, aes(x=Chla, y=a1, color=season))+geom_point()+geom_smooth(method="lm") +xlab("Chlorophyll") +ylab("Amount of Algae A1")
From the plots we learn the following:
There are outliers in the Chloride and Nitrate measurements that should be removed.
These are probably the result of a chemical spill or a specific event in the river system
The maximum pH shows strong seasonality but most of the other variables don’t.
For now I am not going to remove those data points as I am going to follow the book in carrying out the regression but I will repeat the process removing those data points.
Quality Control and Imputation
If you look at the data there are examples of missing data. For quality control you want to remove all of the rows that contain a large proportion of these missing data-points as if there are multiple variables missing for a sample the use of imputation methods will give poor results.
clean_algae <- algae[-manyNAs(algae), ]
There are still issues with missing values which will affect some machine learning techniques, (specifically in this case multiple regression models) and these missing points can be filled in using imputation methods.
Personally I am not a very big fan of imputation methods especially when you are going to use the result for model training and data-mining which is quite sensitive to the training data. My old Professor would have called this garbage in and garbage out, as he was very conservative about methods. If you start a project with bad data then no matter how sophisticated your methods you still cannot produce a good result. There is even a proverb about this - You cannot make a silk purse out of a sow’s ear. This suggests that over-optimistic use of resources is not something new.
If you do use imputation the k-nearest neighbour method is probably the least objectionable. I also prefer medians over means because of robustness question. When looking at this data which is far from normal medians are going to give better performance.
The imputed data can then be used for multiple linear regression methods. Linear regression is used as we need to build a model that will predict the values of the algal blooms for new sets of chemical readings. Creating linear models in R is simple but it is the assessment of those models that is critical.
This model is not very good. From the ANOVA table season is clearly not significant and can be removed from the model and the model updated. We can then repeat the process removing variables that make negligible contributions until we get a final model. This can be done automatically in R to perform a backward elimination.
The performance of the models is evaluated using AIC - Akaike Information Criterion. The final model contains a lot less variables than the initial model which was specified as using everything. However the summary of the final model is not very impressive.
My result is different to that from the book as I used a different method for imputation (medians not means). These fitting methods are very sensitive to changes in the initial data and that is a general issue with data mining that you need to be aware of and to test for.
In the textbook his main suggestion for the poor fit is deviations from normality. I think the other major issues are these variables are only a small part of the explanation. There are no measured biological variables. If you actually fit models including all of the algal types then the model has an R-squared of 0.5. There are interactions between the algal species which are not captured by the chemical measurements. An algal bloom will not just happen because the chemistry of the water fits the requirements, it depends on there being some algae present to start with, then on sunlight, temperature and the biology of each species. This is a point the author discusses in the chapter on data mining. Often the limitations of the data occur because the person analysing the data wasn’t involved in the design of the experiment. This results in poor sampling or missing variables.
Correlation and Collinearity
In the book the author uses correlations between the data as a possible imputation scheme. However, correlations are also important for finding if there is collinearity as it make sense to exclude very strongly correlated variables from the regression as they effectively contain the same information.
Collinearity is more important when there are a large number of explanatory variables and a small number of cases as over-fitting and the challenge of the “curse of dimensionality” are significant issues. In this case there are still a large number of cases (200) compared to the number of explanatory variables (11).
cm <-cor(algae[,4:18],use="complete.obs")corrplot(cm,type="upper",tl.pos="d",tl.cex=0.75)corrplot(cm,add=TRUE, type="lower", method="number",tl.cex=0.75, diag=FALSE,tl.pos="n", cl.pos="n")
In this case you want to use PO4 or OPO4 but not both and possibly NO3 or NH4 but not both. First you can check this out by building models without each of the different types of phosphate.
Removing either NH4 or NO3 both make the models slightly worse, especially removing NO3 and this suggests that the correlation might not be sufficient to make it necessary to remove either of the variables.
By considering the correlations you get a better understanding of why the backward elimination model does not contain oPO4 and PO4 and why NO3 is significant but NH4 is not at the 0.05 level.
Forward and Backward Elimination
I wanted to double check that the backward elimination had not become trapped in any local minimum as the ANOVA table does show NH4 to be insignificant in the model. I computed the model using the forward and backward elimination method based on model 3 which has oPO4 and NH4 eliminated from the model
This yields a simpler model than before that no longer contains NH4 or mnO2 but now contains Cl and at the cost of 0.0029 in the adjusted R-squared, but with a higher F-statistic because of less degrees of freedom. The difference between the models from my perspective is tiny and you could use either. But this raises questions about the stability of the models and what variables matter.
Dealing with the Outliers
This analysis did not remove the outliers. I am going to repeat the process removing the outliers. Firstly I want to look at the plots again with them removed and with the imputed data points.
This has a better adjusted R-squared than the original model and it was definitely correct to remove those outliers. also models 6 and 7 are the same but models 1 and 4 are different yet they were created by the same processes.
However there are some questions about the imputation. Should I impute with the outliers in and then remove them or remove them before imputing the data? I used the median which means it should not be affected by these outliers. It is always best to try it put just to check what is happening.
Imputation After Removing Outliers
In theory this should not make any difference to the models produced.
The models are different to those from before as chlorophyll has been added and maximum pH removed and this reduces the adjusted R-squared. There are also 10 fewer cases in the dataset. The problem is that the filter did not remove the single outliers which is what we wanted. It removed all the NA values as well. This means we need to use another way of removing the cases that we do not want which are rows 133 and 152.
This worked in the way we expected it to and gave the result that we expected. This is very similar to the model with imputation before outlier removal.
An Estimate of Model Performance
In this case we are trying to predict values and we can compare the prediction with a set of data as we do with all linear regression models. You see how well the predicted values fit the values you used to train the dataset. This uses the predict function in R. then we can calculate the residuals as a mean absolute error, a mean squared error or a normalised mean square error.
One of the weaknesses of the linear model is that it predicts negative values for the amount of algal bloom which are not possible. You can improve the performance of the model by excluding impossible values. Constraints on data are a very important part of effective data mining and they need to be considered from the start.
# A tibble: 3 × 4
model mae mse nmse
<dbl> <dbl> <dbl> <dbl>
1 4 12.5 286. 0.627
2 7 12.4 285. 0.622
3 11 12.4 285. 0.622
Model 4 is the first reduced model taking into account collinearity but without the outliers removed. Model 7 has the outliers removed after imputation. Model 11 has the outliers removed before imputation.
All of the models only include PO4, NO3, Cl mxPH and size medium or small. None of the other variables contribute to the models.
A serious concern for anyone understanding the chemical and biological background is the direction of the models. They all have coefficients which are negative and an intercept which is high (an algal bloom is the default). This is not what you are expecting or hoping for. You expect increased chemical concentrations to be contributing to the blooms but the models are all negative slopes. This means there are serious issues with the models and they are all missing some essential variables.
These models only explain about 1/3 of the variability in algal amounts and that other 2/3rds looks to be present in other variables.
If you look at the density fitting they have most of their density at 0 but with a long tail. This suggests a sporadic process that is not a response to the predictive variables. The correlations show that the algal species are mostly competitive except species 5 and 6 which have a cooperative relationship.
The results for species a7 are terrible if they are based on the predictors with a very slight improvement if you include the correlated algal species 1. This tells you that the data collected has not explanatory value for algal species 7 and there is a failure to capture any relevant measurements.
model1.a7 <-lm(a7~. ,data=algae_imputed4 [,c(1:11,18)])final1.lm.a7 <-stepAIC(model1.a7, direction ="both")model2.a7 <-lm(a7~. ,data=algae_imputed4 [,c(1:18)])final2.lm.a7 <-stepAIC(model2.a7, direction ="both")
As I said above it is a serious concern that the intercepts are positive and the slopes are negative for the algae. This is a reverse of the expected cause and effect. This indicates that you have high levels of nitrates and phosphates when there are not algal blooms and they are lower during blooms. This would indicate the blooms and consuming all of the available nutrients when they occur. This is a possible but not useful as you cannot use this as a prediction as you are seeing the consequences/effects and not the causes.
If this is true can we use the data from the algal blooms to predict the amounts of phosphate and nitrate? The answer is yes and more effectively than we predict the algal numbers. The other variables chloride, pH etc,, are not going to be affected by the blooms and they can be left in the models.
cm <-cor(algae_imputed4[,4:18],use="complete.obs")corrplot(cm,type="upper",tl.pos="d",tl.cex=0.75)corrplot(cm,add=TRUE, type="lower", method="number",tl.cex=0.75, diag=FALSE,tl.pos="n", cl.pos="n")
The Adjusted R-squared values and the ANOVA tables are a lot better for these predictions than they were for the algal predictions and in this case there are only 4 variables to predict. The results also make biological sense in terms of river speed, oxygen levels and accumulation of nitrate and phosphate
From these models you can get an understanding of the critical factors for algal blooms. They occur at lower concentrations in smaller and slower rivers. This is what we would expect. You can also look at the intercept where there are no algae and determine the levels of phosphate and nitrate that a river can safely support before an algal bloom is likely.
You could also generate a new version of the model where the algal levels are turned into a binary variable - there is a bloom or there isn’t and this might be a more useful tool and create better models for prediction and fitting.
Key Lessons
Know your data, data out of context cannot produce reliable insights.
Ask the right questions of the model what should the model predict?
Think about causes and effects.
The iterative design of data collection is essential to collect all the data that you need.
Never underestimate the amount of human input necessary for successful machine learning. You can’t find the right models if the model is not within the data.