Data Mining Case Study: Algal Blooms - Linear Models

Author

Dr Andrew Dalby

Background

When we are trying to carry out any scientific investigation there are often things that are easy to measure and things that are hard to measure. What we want to do is use the easy measures to make predictions about the hard things to measure.

An example is the Iris data collected by Fisher. There are different species of Iris but to identify which is which you need an Iris/plant expert. Fisher collected measurements of lengths and widths related to the flowers to try and use these to automate the process of classification without needing an expert. This is often used as a classic example of data-mining in trying to successfully classify the Irises into the three different species based on this data.

The first case out of the Data Mining with R textbook is about algal blooms in rivers. Here is the easy to measure data is the chemical monitoring of the river. The data you want to predict - the hard to collect data is the biological data of the algal blooms.

Algal Bloom Dataset

Collected as part of the ERUDIT research network it has been used in data analysis competitions and it is part of the UCI Machine Learning Repository. The training set consists of 200 samples and the testing set contains 140 samples. Each observation contains 11 variables. Three are nominal and describe the season and river volume of collection. The other 8 are chemical parameters:

Maximum pH value
Minimum value of oxygen
Mean value of chloride
Mean value of nitrates
Mean value of ammonium
Mean value of orthophosphate
Mean value of total phosphate
Mean of chlorophyll

For the 200 training set data-points there are also the frequencies of 7 different harmful algal species.

In this case unlike the Iris data case where the output is a classification system this data requires a predictive model. You could consider that the variables are correlated but the argument is that these chemical levels contribute to the algal blooms and so they are part of the cause of the algal blooms.

library("DMwR2")

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library("car")

Cargando paquete requerido: carData

library("ggplot2")
library("grid")
library("Hmisc")


Adjuntando el paquete: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, units

library("lattice")
data(algae)
head(algae)

# A tibble: 6 × 18
  season size  speed  mxPH  mnO2    Cl   NO3   NH4  oPO4   PO4  Chla    a1    a2
  <fct>  <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 winter small medi…  8      9.8  60.8  6.24 578   105   170    50     0     0  
2 spring small medi…  8.35   8    57.8  1.29 370   429.  559.    1.3   1.4   7.6
3 autumn small medi…  8.1   11.4  40.0  5.33 347.  126.  187.   15.6   3.3  53.6
4 spring small medi…  8.07   4.8  77.4  2.30  98.2  61.2 139.    1.4   3.1  41  
5 autumn small medi…  8.06   9    55.4 10.4  234.   58.2  97.6  10.5   9.2   2.9
6 winter small high   8.25  13.1  65.8  9.25 430    18.2  56.7  28.4  15.1  14.6
# ℹ 5 more variables: a3 <dbl>, a4 <dbl>, a5 <dbl>, a6 <dbl>, a7 <dbl>

Exploratory Data Analysis

When you have any dataset then it is important for you to summarise the data and carry out some exploratory data analysis before you start doing any model building.

summary(algae)

    season       size       speed         mxPH            mnO2       
 autumn:40   large :45   high  :84   Min.   :5.600   Min.   : 1.500  
 spring:53   medium:84   low   :33   1st Qu.:7.700   1st Qu.: 7.725  
 summer:45   small :71   medium:83   Median :8.060   Median : 9.800  
 winter:62                           Mean   :8.012   Mean   : 9.118  
                                     3rd Qu.:8.400   3rd Qu.:10.800  
                                     Max.   :9.700   Max.   :13.400  
                                     NA's   :1       NA's   :2       
       Cl               NO3              NH4                oPO4       
 Min.   :  0.222   Min.   : 0.050   Min.   :    5.00   Min.   :  1.00  
 1st Qu.: 10.981   1st Qu.: 1.296   1st Qu.:   38.33   1st Qu.: 15.70  
 Median : 32.730   Median : 2.675   Median :  103.17   Median : 40.15  
 Mean   : 43.636   Mean   : 3.282   Mean   :  501.30   Mean   : 73.59  
 3rd Qu.: 57.824   3rd Qu.: 4.446   3rd Qu.:  226.95   3rd Qu.: 99.33  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :564.60  
 NA's   :10        NA's   :2        NA's   :2          NA's   :2       
      PO4              Chla               a1              a2        
 Min.   :  1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.: 41.38   1st Qu.:  2.000   1st Qu.: 1.50   1st Qu.: 0.000  
 Median :103.29   Median :  5.475   Median : 6.95   Median : 3.000  
 Mean   :137.88   Mean   : 13.971   Mean   :16.92   Mean   : 7.458  
 3rd Qu.:213.75   3rd Qu.: 18.308   3rd Qu.:24.80   3rd Qu.:11.375  
 Max.   :771.60   Max.   :110.456   Max.   :89.80   Max.   :72.600  
 NA's   :2        NA's   :12                                        
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.550   Median : 0.000   Median : 1.900   Median : 0.000  
 Mean   : 4.309   Mean   : 1.992   Mean   : 5.064   Mean   : 5.964  
 3rd Qu.: 4.925   3rd Qu.: 2.400   3rd Qu.: 7.500   3rd Qu.: 6.925  
 Max.   :42.800   Max.   :44.600   Max.   :44.400   Max.   :77.600  
                                                                    
       a7        
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 1.000  
 Mean   : 2.495  
 3rd Qu.: 2.400  
 Max.   :31.600

The original book used base-R for the visualisation but you can do more with ggplot2 which was used for the second edition.

ggplot(algae,aes(x=mxPH)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +
  ggtitle("The Histogram of mxPH (maximum pH)") + 
  xlab("Maximum pH") + ylab("Density")

qqPlot(algae$mxPH,main='Normal QQ plot of maximum pH',ylab="")

[1] 56 57

We can also examine the outputs as well as the predictive variables in the same way. In this case using boxplots and violin plots to see if there are differences in algal blooms for species A1 between the different river sizes.

ggplot(algae, aes(x=size, y=a1))+
  geom_boxplot()+
  labs(x="River Size", y="Amount of Algae A1")

ggplot(algae, aes(x=size, y=a1,  fill=size))+
  geom_violin()+
  scale_fill_brewer(palette = "Blues")+
  labs(x="River Size", y="Amount of Algae A1")+
  theme(legend.position = "none")

bwplot(size ~a1, data=algae, panel=panel.bpplot, probs=seq(.01,.49, by=.01), datadensity=TRUE, ylab="River Size", xlab="Algae A1")

River size does not seem to have a simple relationship to the amount of algae A1. Small rivers have less small amounts but this might be because small rivers have a faster flow rate.

When you have large numbers of continuous variables and you think that there might be an association with a categorical variable that contributes to the response variable/outcome then you can try to stratify the data into levels containing an equal number of cases and then plot the results in a strip-plot. In this case if you think that the minimum amount of oxygen produces different patterns for blooms for algae A1 in between the different seasons, you would produce the following plot.

min02 <- equal.count(na.omit(algae$mnO2),number=4,overlap=1/5)
stripplot(season ~ a1|min02, data=algae[!is.na(algae$mnO2),])

These results suggests that there is no seasonality between minimum oxygen levels and the amount of algae A1, except at the lowest levels where there are less cases in the autumn.

Another useful tool for summary plots is the GGally library

library("GGally")
ggpairs(algae,columns=1:5)

ggpairs(algae, columns=4:11)

Scatterplots

As we are going to construct a linear model it also makes sense to plot the scatterplots for the variables.

ggplot(data=algae, aes(x=mxPH, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Maximum pH") + ylab("Amount of Algae A1")

ggplot(data=algae, aes(x=mnO2, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Minimum Oxygen") + ylab("Amount of Algae A1")

ggplot(data=algae, aes(x=PO4, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Phosphate") + ylab("Amount of Algae A1")

ggplot(data=algae, aes(x=Cl, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Chloride") + ylab("Amount of Algae A1")

ggplot(data=algae, aes(x=NO3, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Nitrate") + ylab("Amount of Algae A1")

ggplot(data=algae, aes(x=Chla, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Chlorophyll") + ylab("Amount of Algae A1")

From the plots we learn the following:

There are outliers in the Chloride and Nitrate measurements that should be removed.
- These are probably the result of a chemical spill or a specific event in the river system
The maximum pH shows strong seasonality but most of the other variables don’t.

For now I am not going to remove those data points as I am going to follow the book in carrying out the regression but I will repeat the process removing those data points.

Quality Control and Imputation

If you look at the data there are examples of missing data. For quality control you want to remove all of the rows that contain a large proportion of these missing data-points as if there are multiple variables missing for a sample the use of imputation methods will give poor results.

clean_algae <- algae[-manyNAs(algae), ]

There are still issues with missing values which will affect some machine learning techniques, (specifically in this case multiple regression models) and these missing points can be filled in using imputation methods.

Personally I am not a very big fan of imputation methods especially when you are going to use the result for model training and data-mining which is quite sensitive to the training data. My old Professor would have called this garbage in and garbage out, as he was very conservative about methods. If you start a project with bad data then no matter how sophisticated your methods you still cannot produce a good result. There is even a proverb about this - You cannot make a silk purse out of a sow’s ear. This suggests that over-optimistic use of resources is not something new.

If you do use imputation the k-nearest neighbour method is probably the least objectionable. I also prefer medians over means because of robustness question. When looking at this data which is far from normal medians are going to give better performance.

algae_imputed1 <- knnImputation(clean_algae, k=10, meth = "median")

Linear Regression for Algal Species 1

The imputed data can then be used for multiple linear regression methods. Linear regression is used as we need to build a model that will predict the values of the algal blooms for new sets of chemical readings. Creating linear models in R is simple but it is the assessment of those models that is critical.

model1.a1 <- lm(a1~.,data=algae_imputed1[, 1:12])
summary(model1.a1)


Call:
lm(formula = a1 ~ ., data = algae_imputed1[, 1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-37.693 -11.821  -2.672   7.159  62.181 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  41.989160  24.089719   1.743  0.08302 . 
seasonspring  3.751878   4.132182   0.908  0.36510   
seasonsummer  0.783123   4.018442   0.195  0.84570   
seasonwinter  3.713428   3.863198   0.961  0.33771   
sizemedium    3.283115   3.799908   0.864  0.38873   
sizesmall     9.687246   4.179069   2.318  0.02156 * 
speedlow      3.973299   4.704463   0.845  0.39945   
speedmedium   0.275558   3.240694   0.085  0.93233   
mxPH         -3.512725   2.715062  -1.294  0.19738   
mnO2          1.080363   0.704025   1.535  0.12663   
Cl           -0.040298   0.033665  -1.197  0.23285   
NO3          -1.513109   0.551199  -2.745  0.00666 **
NH4           0.001643   0.001003   1.639  0.10301   
oPO4         -0.006014   0.039892  -0.151  0.88032   
PO4          -0.051532   0.030783  -1.674  0.09584 . 
Chla         -0.090460   0.080298  -1.127  0.26141   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.64 on 182 degrees of freedom
Multiple R-squared:  0.3737,    Adjusted R-squared:  0.3221 
F-statistic: 7.241 on 15 and 182 DF,  p-value: 2.265e-12

anova(model1.a1)

Analysis of Variance Table

Response: a1
           Df Sum Sq Mean Sq F value    Pr(>F)    
season      3     85    28.2  0.0906 0.9651477    
size        2  11401  5700.7 18.3261 5.609e-08 ***
speed       2   3934  1967.2  6.3239 0.0022120 ** 
mxPH        1   1335  1334.7  4.2907 0.0397316 *  
mnO2        1   2358  2358.0  7.5801 0.0065000 ** 
Cl          1   4312  4312.5 13.8632 0.0002620 ***
NO3         1   3417  3416.8 10.9841 0.0011089 ** 
NH4         1    411   410.7  1.3204 0.2520300    
oPO4        1   4748  4748.4 15.2648 0.0001316 ***
PO4         1   1390  1390.0  4.4685 0.0358879 *  
Chla        1    395   394.8  1.2691 0.2614130    
Residuals 182  56615   311.1                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model is not very good. From the ANOVA table season is clearly not significant and can be removed from the model and the model updated. We can then repeat the process removing variables that make negligible contributions until we get a final model. This can be done automatically in R to perform a backward elimination.

final1.lm <- step(model1.a1)

Start:  AIC=1151.84
a1 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
    PO4 + Chla

         Df Sum of Sq   RSS    AIC
- season  3    449.80 57065 1147.4
- speed   2    274.09 56889 1148.8
- oPO4    1      7.07 56622 1149.9
- Chla    1    394.79 57010 1151.2
- Cl      1    445.74 57061 1151.4
- mxPH    1    520.70 57136 1151.7
<none>                56615 1151.8
- mnO2    1    732.53 57348 1152.4
- NH4     1    835.28 57450 1152.7
- PO4     1    871.77 57487 1152.9
- size    2   1854.69 58470 1154.2
- NO3     1   2344.14 58959 1157.9

Step:  AIC=1147.41
a1 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + 
    Chla

        Df Sum of Sq   RSS    AIC
- speed  2    214.65 57279 1144.2
- oPO4   1      9.31 57074 1145.4
- Chla   1    367.68 57432 1146.7
- Cl     1    407.83 57473 1146.8
- mxPH   1    440.66 57505 1146.9
- mnO2   1    488.37 57553 1147.1
<none>               57065 1147.4
- NH4    1    784.82 57850 1148.1
- PO4    1    832.51 57897 1148.3
- size   2   2172.04 59237 1150.8
- NO3    1   2424.79 59490 1153.7

Step:  AIC=1144.15
a1 ~ size + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla

       Df Sum of Sq   RSS    AIC
- oPO4  1     18.67 57298 1142.2
- Chla  1    239.49 57519 1143.0
- mnO2  1    443.48 57723 1143.7
- mxPH  1    460.35 57740 1143.7
- Cl    1    477.07 57757 1143.8
<none>              57279 1144.2
- NH4   1    727.59 58007 1144.7
- PO4   1    780.01 58059 1144.8
- size  2   2051.61 59331 1147.1
- NO3   1   2383.87 59663 1150.2

Step:  AIC=1142.22
a1 ~ size + mxPH + mnO2 + Cl + NO3 + NH4 + PO4 + Chla

       Df Sum of Sq   RSS    AIC
- Chla  1     221.4 57520 1141.0
- mnO2  1     431.0 57729 1141.7
- Cl    1     474.6 57773 1141.8
- mxPH  1     498.5 57797 1141.9
<none>              57298 1142.2
- NH4   1     709.9 58008 1142.7
- size  2    2037.1 59335 1145.1
- NO3   1    2374.0 59672 1148.2
- PO4   1    5724.6 63023 1159.1

Step:  AIC=1140.98
a1 ~ size + mxPH + mnO2 + Cl + NO3 + NH4 + PO4

       Df Sum of Sq   RSS    AIC
- Cl    1     437.4 57957 1140.5
- mnO2  1     467.1 57987 1140.6
<none>              57520 1141.0
- NH4   1     754.7 58274 1141.6
- mxPH  1     828.6 58348 1141.8
- size  2    2206.3 59726 1144.4
- NO3   1    2685.7 60205 1148.0
- PO4   1    6241.0 63761 1159.4

Step:  AIC=1140.48
a1 ~ size + mxPH + mnO2 + NO3 + NH4 + PO4

       Df Sum of Sq   RSS    AIC
<none>              57957 1140.5
- mnO2  1     615.8 58573 1140.6
- mxPH  1     933.2 58890 1141.6
- NH4   1    1062.6 59020 1142.1
- size  2    2434.4 60391 1144.6
- NO3   1    3567.2 61524 1150.3
- PO4   1    8387.7 66345 1165.2

summary(final1.lm)


Call:
lm(formula = a1 ~ size + mxPH + mnO2 + NO3 + NH4 + PO4, data = algae_imputed1[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-37.925 -11.833  -3.629   7.365  64.654 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 51.8804589 21.7656071   2.384 0.018130 *  
sizemedium   3.1466170  3.4035299   0.925 0.356391    
sizesmall   10.1836028  3.8198841   2.666 0.008339 ** 
mxPH        -4.3215171  2.4706918  -1.749 0.081886 .  
mnO2         0.8957962  0.6304751   1.421 0.157006    
NO3         -1.7667732  0.5166437  -3.420 0.000767 ***
NH4          0.0017827  0.0009552   1.866 0.063530 .  
PO4         -0.0612392  0.0116784  -5.244 4.17e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.47 on 190 degrees of freedom
Multiple R-squared:  0.3589,    Adjusted R-squared:  0.3353 
F-statistic: 15.19 on 7 and 190 DF,  p-value: 1.02e-15

The performance of the models is evaluated using AIC - Akaike Information Criterion. The final model contains a lot less variables than the initial model which was specified as using everything. However the summary of the final model is not very impressive.

My result is different to that from the book as I used a different method for imputation (medians not means). These fitting methods are very sensitive to changes in the initial data and that is a general issue with data mining that you need to be aware of and to test for.

In the textbook his main suggestion for the poor fit is deviations from normality. I think the other major issues are these variables are only a small part of the explanation. There are no measured biological variables. If you actually fit models including all of the algal types then the model has an R-squared of 0.5. There are interactions between the algal species which are not captured by the chemical measurements. An algal bloom will not just happen because the chemistry of the water fits the requirements, it depends on there being some algae present to start with, then on sunlight, temperature and the biology of each species. This is a point the author discusses in the chapter on data mining. Often the limitations of the data occur because the person analysing the data wasn’t involved in the design of the experiment. This results in poor sampling or missing variables.

Correlation and Collinearity

In the book the author uses correlations between the data as a possible imputation scheme. However, correlations are also important for finding if there is collinearity as it make sense to exclude very strongly correlated variables from the regression as they effectively contain the same information.

Collinearity is more important when there are a large number of explanatory variables and a small number of cases as over-fitting and the challenge of the “curse of dimensionality” are significant issues. In this case there are still a large number of cases (200) compared to the number of explanatory variables (11).

library("corrplot")

corrplot 0.95 loaded

symnum(cor(algae_imputed1[, 4:18], use="complete.obs"))

     mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7
mxPH 1                                         
mnO2    1                                      
Cl         1                                   
NO3           1                                
NH4           ,  1                             
oPO4    .  .        1                          
PO4     .  .        * 1                        
Chla .                  1                      
a1         .        . .    1                   
a2   .                  .     1                
a3                               1             
a4      .             .             1          
a5                                     1       
a6            .  .                     .  1    
a7                                           1 
attr(,"legend")
[1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

cm <- cor(algae[,4:18],use="complete.obs")
corrplot(cm,type="upper",tl.pos="d",tl.cex=0.75)
corrplot(cm,add=TRUE, type="lower", method="number",tl.cex=0.75, diag=FALSE,tl.pos="n", cl.pos="n")

In this case you want to use PO4 or OPO4 but not both and possibly NO3 or NH4 but not both. First you can check this out by building models without each of the different types of phosphate.

library(MASS)
model2.a1 <- lm(a1 ~. -PO4,data=algae_imputed1 [,1:12])
summary(model2.a1)


Call:
lm(formula = a1 ~ . - PO4, data = algae_imputed1[, 1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-36.955 -11.958  -2.966   6.942  62.048 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  35.871236  23.927874   1.499 0.135560    
seasonspring  3.866032   4.151916   0.931 0.353006    
seasonsummer  0.899135   4.037583   0.223 0.824024    
seasonwinter  3.419050   3.878153   0.882 0.379140    
sizemedium    1.132470   3.593722   0.315 0.753026    
sizesmall     9.200834   4.189437   2.196 0.029335 *  
speedlow      2.682292   4.663623   0.575 0.565895    
speedmedium  -0.502129   3.222980  -0.156 0.876365    
mxPH         -3.008186   2.711539  -1.109 0.268713    
mnO2          1.293524   0.695816   1.859 0.064632 .  
Cl           -0.047761   0.033533  -1.424 0.156058    
NO3          -1.533401   0.553773  -2.769 0.006202 ** 
NH4           0.001740   0.001006   1.730 0.085357 .  
oPO4         -0.066493   0.017000  -3.911 0.000129 ***
Chla         -0.131123   0.076912  -1.705 0.089919 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.72 on 183 degrees of freedom
Multiple R-squared:  0.3641,    Adjusted R-squared:  0.3154 
F-statistic: 7.484 on 14 and 183 DF,  p-value: 2.738e-12

model3.a1 <- lm(a1 ~. -oPO4,data=algae_imputed1 [,1:12])
summary(model3.a1)


Call:
lm(formula = a1 ~ . - oPO4, data = algae_imputed1[, 1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-37.578 -11.805  -2.752   7.085  62.474 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  42.5618727 23.7247436   1.794  0.07447 .  
seasonspring  3.7190286  4.1154007   0.904  0.36735    
seasonsummer  0.7398563  3.9974660   0.185  0.85337    
seasonwinter  3.7162500  3.8528241   0.965  0.33604    
sizemedium    3.4558272  3.6134425   0.956  0.34014    
sizesmall     9.6952440  4.1675596   2.326  0.02109 *  
speedlow      4.0485802  4.6653837   0.868  0.38665    
speedmedium   0.3066647  3.2254723   0.095  0.92436    
mxPH         -3.5743719  2.6769214  -1.335  0.18345    
mnO2          1.0729310  0.7004193   1.532  0.12729    
Cl           -0.0402698  0.0335745  -1.199  0.23192    
NO3          -1.5115441  0.5496279  -2.750  0.00656 ** 
NH4           0.0016251  0.0009929   1.637  0.10342    
PO4          -0.0557356  0.0130192  -4.281    3e-05 ***
Chla         -0.0870750  0.0768890  -1.132  0.25892    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.59 on 183 degrees of freedom
Multiple R-squared:  0.3737,    Adjusted R-squared:  0.3257 
F-statistic: 7.798 on 14 and 183 DF,  p-value: 7.954e-13

Of the two models with one of the highly correlated values removed the model that still contains PO4 is slightly better.

model4.a1 <- lm(a1  ~. -oPO4 -NH4,data=algae_imputed1 [,1:12])
summary(model4.a1)


Call:
lm(formula = a1 ~ . - oPO4 - NH4, data = algae_imputed1[, 1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-29.905 -11.431  -3.836   8.085  61.527 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  43.25343   23.82894   1.815   0.0711 .  
seasonspring  3.28662    4.12560   0.797   0.4267    
seasonsummer  0.25946    4.00482   0.065   0.9484    
seasonwinter  3.25412    3.85995   0.843   0.4003    
sizemedium    2.91521    3.61469   0.806   0.4210    
sizesmall     9.83656    4.18563   2.350   0.0198 *  
speedlow      3.30337    4.66424   0.708   0.4797    
speedmedium   0.14779    3.23868   0.046   0.9637    
mxPH         -3.38611    2.68662  -1.260   0.2091    
mnO2          0.78404    0.68090   1.151   0.2510    
Cl           -0.05179    0.03298  -1.571   0.1180    
NO3          -0.85244    0.37575  -2.269   0.0245 *  
PO4          -0.05324    0.01299  -4.099 6.21e-05 ***
Chla         -0.08950    0.07722  -1.159   0.2480    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.67 on 184 degrees of freedom
Multiple R-squared:  0.3645,    Adjusted R-squared:  0.3196 
F-statistic: 8.118 on 13 and 184 DF,  p-value: 8.85e-13

model5.a1 <- lm(a1 ~. -oPO4 -NO3,data=algae_imputed1 [,1:12])
summary(model5.a1)


Call:
lm(formula = a1 ~ . - oPO4 - NO3, data = algae_imputed1[, 1:12])

Residuals:
   Min     1Q Median     3Q    Max 
-31.26 -11.61  -3.88   8.45  63.59 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  34.1204477 23.9412354   1.425   0.1558    
seasonspring  4.3988419  4.1805927   1.052   0.2941    
seasonsummer  0.9560153  4.0673486   0.235   0.8144    
seasonwinter  3.8191514  3.9207512   0.974   0.3313    
sizemedium    2.0758074  3.6416910   0.570   0.5694    
sizesmall    10.1089029  4.2384724   2.385   0.0181 *  
speedlow      3.6752507  4.7458502   0.774   0.4397    
speedmedium   0.2079658  3.2822905   0.063   0.9495    
mxPH         -2.1619764  2.6736388  -0.809   0.4198    
mnO2          0.5215294  0.6829727   0.764   0.4461    
Cl           -0.0654389  0.0328741  -1.991   0.0480 *  
NH4          -0.0003757  0.0006877  -0.546   0.5855    
PO4          -0.0561523  0.0132485  -4.238 3.56e-05 ***
Chla         -0.1170280  0.0774593  -1.511   0.1325    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.9 on 184 degrees of freedom
Multiple R-squared:  0.3478,    Adjusted R-squared:  0.3017 
F-statistic: 7.547 on 13 and 184 DF,  p-value: 7.514e-12

Removing either NH4 or NO3 both make the models slightly worse, especially removing NO3 and this suggests that the correlation might not be sufficient to make it necessary to remove either of the variables.

By considering the correlations you get a better understanding of why the backward elimination model does not contain oPO4 and PO4 and why NO3 is significant but NH4 is not at the 0.05 level.

Forward and Backward Elimination

I wanted to double check that the backward elimination had not become trapped in any local minimum as the ANOVA table does show NH4 to be insignificant in the model. I computed the model using the forward and backward elimination method based on model 3 which has oPO4 and NH4 eliminated from the model

final4.lm <- stepAIC(model4.a1, direction = "both")

Start:  AIC=1150.74
a1 ~ (season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + 
    oPO4 + PO4 + Chla) - oPO4 - NH4

         Df Sum of Sq   RSS    AIC
- season  3     398.8 57850 1146.1
- speed   2     199.4 57650 1147.4
- mnO2    1     414.0 57865 1150.2
- Chla    1     419.4 57870 1150.2
- mxPH    1     496.0 57947 1150.4
<none>                57451 1150.7
- Cl      1     770.1 58221 1151.4
- size    2    2035.0 59486 1153.6
- NO3     1    1607.0 59058 1154.2
- PO4     1    5246.4 62697 1166.0

Step:  AIC=1146.11
a1 ~ size + speed + mxPH + mnO2 + Cl + NO3 + PO4 + Chla

         Df Sum of Sq   RSS    AIC
- speed   2     158.4 58008 1142.7
- mnO2    1     242.7 58092 1144.9
- Chla    1     383.7 58233 1145.4
- mxPH    1     421.3 58271 1145.5
<none>                57850 1146.1
- Cl      1     708.6 58558 1146.5
- size    2    2359.0 60209 1150.0
- NO3     1    1803.2 59653 1150.2
+ season  3     398.8 57451 1150.7
- PO4     1    5217.1 63067 1161.2

Step:  AIC=1142.65
a1 ~ size + mxPH + mnO2 + Cl + NO3 + PO4 + Chla

         Df Sum of Sq   RSS    AIC
- mnO2    1     220.3 58228 1141.4
- Chla    1     266.2 58274 1141.6
- mxPH    1     446.8 58455 1142.2
<none>                58008 1142.7
- Cl      1     789.5 58798 1143.3
+ speed   2     158.4 57850 1146.1
- size    2    2285.4 60293 1146.3
- NO3     1    1858.4 59866 1146.9
+ season  3     357.8 57650 1147.4
- PO4     1    5376.0 63384 1158.2

Step:  AIC=1141.4
a1 ~ size + mxPH + Cl + NO3 + PO4 + Chla

         Df Sum of Sq   RSS    AIC
- Chla    1     288.3 58517 1140.4
- mxPH    1     434.8 58663 1140.9
<none>                58228 1141.4
- Cl      1     882.3 59111 1142.4
+ mnO2    1     220.3 58008 1142.7
+ speed   2     135.9 58092 1144.9
- NO3     1    1664.2 59892 1145.0
- size    2    2438.7 60667 1145.5
+ season  3     222.5 58006 1146.7
- PO4     1    7682.5 65911 1163.9

Step:  AIC=1140.38
a1 ~ size + mxPH + Cl + NO3 + PO4

         Df Sum of Sq   RSS    AIC
<none>                58517 1140.4
- mxPH    1     778.0 59295 1141.0
- Cl      1     837.4 59354 1141.2
+ Chla    1     288.3 58228 1141.4
+ mnO2    1     242.4 58274 1141.6
+ speed   2      42.7 58474 1144.2
- NO3     1    1982.4 60499 1145.0
- size    2    2664.7 61181 1145.2
+ season  3     208.0 58309 1145.7
- PO4     1    8564.1 67081 1165.4

summary(final4.lm)


Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = algae_imputed1[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.857 -12.718  -3.736   8.430  62.941 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 57.14748   20.96556   2.726  0.00701 ** 
sizemedium   2.82035    3.40216   0.829  0.40815    
sizesmall   10.41671    3.82225   2.725  0.00702 ** 
mxPH        -3.95660    2.48285  -1.594  0.11268    
Cl          -0.05237    0.03168  -1.653  0.09992 .  
NO3         -0.89421    0.35154  -2.544  0.01176 *  
PO4         -0.05908    0.01117  -5.287 3.38e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.5 on 191 degrees of freedom
Multiple R-squared:  0.3527,    Adjusted R-squared:  0.3324 
F-statistic: 17.35 on 6 and 191 DF,  p-value: 5.551e-16

This yields a simpler model than before that no longer contains NH4 or mnO2 but now contains Cl and at the cost of 0.0029 in the adjusted R-squared, but with a higher F-statistic because of less degrees of freedom. The difference between the models from my perspective is tiny and you could use either. But this raises questions about the stability of the models and what variables matter.

Dealing with the Outliers

This analysis did not remove the outliers. I am going to repeat the process removing the outliers. Firstly I want to look at the plots again with them removed and with the imputed data points.

library("dplyr")
algae_imputed2 <- filter(algae_imputed1, Cl < 250 & NO3 < 15)

ggplot(data=algae_imputed2, aes(x=mxPH, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Maximum pH") + ylab("Amount of Algae A1")

ggplot(data=algae_imputed2, aes(x=mnO2, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Minimum Oxygen") + ylab("Amount of Algae A1")

ggplot(data=algae_imputed2, aes(x=PO4, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Phosphate") + ylab("Amount of Algae A1")

ggplot(data=algae_imputed2, aes(x=Cl, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Chloride") + ylab("Amount of Algae A1")

ggplot(data=algae_imputed2, aes(x=NO3, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Nitrate") + ylab("Amount of Algae A1")

ggplot(data=algae_imputed2, aes(x=Chla, y=a1, color=season))+
  geom_point()+geom_smooth(method="lm") +
  xlab("Chlorophyll") + ylab("Amount of Algae A1")

These plots look a lot better than before.

model6.a1 <- lm(a1 ~. ,data=algae_imputed2[, 1:12])

final6.lm <- step(model6.a1)

model7.a1 <- lm(a1  ~. -oPO4 -NH4,data=algae_imputed2 [,1:12])

final7.lm <- stepAIC(model7.a1, direction = "both")

summary(final6.lm)


Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = algae_imputed2[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.647 -12.487  -3.251   7.402  62.880 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.32776   21.11423   2.857  0.00475 ** 
sizemedium   3.18702    3.40043   0.937  0.34983    
sizesmall   10.42366    3.81097   2.735  0.00683 ** 
mxPH        -4.12426    2.49791  -1.651  0.10038    
Cl          -0.06806    0.03925  -1.734  0.08457 .  
NO3         -1.70430    0.63503  -2.684  0.00793 ** 
PO4         -0.05223    0.01159  -4.505 1.16e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 189 degrees of freedom
Multiple R-squared:  0.3639,    Adjusted R-squared:  0.3437 
F-statistic: 18.02 on 6 and 189 DF,  p-value: < 2.2e-16

summary(final7.lm)


Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = algae_imputed2[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.647 -12.487  -3.251   7.402  62.880 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.32776   21.11423   2.857  0.00475 ** 
sizemedium   3.18702    3.40043   0.937  0.34983    
sizesmall   10.42366    3.81097   2.735  0.00683 ** 
mxPH        -4.12426    2.49791  -1.651  0.10038    
Cl          -0.06806    0.03925  -1.734  0.08457 .  
NO3         -1.70430    0.63503  -2.684  0.00793 ** 
PO4         -0.05223    0.01159  -4.505 1.16e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 189 degrees of freedom
Multiple R-squared:  0.3639,    Adjusted R-squared:  0.3437 
F-statistic: 18.02 on 6 and 189 DF,  p-value: < 2.2e-16

This has a better adjusted R-squared than the original model and it was definitely correct to remove those outliers. also models 6 and 7 are the same but models 1 and 4 are different yet they were created by the same processes.

However there are some questions about the imputation. Should I impute with the outliers in and then remove them or remove them before imputing the data? I used the median which means it should not be affected by these outliers. It is always best to try it put just to check what is happening.

Imputation After Removing Outliers

In theory this should not make any difference to the models produced.

summary(final8.lm)


Call:
lm(formula = a1 ~ size + mnO2 + Cl + NO3 + oPO4 + Chla, data = algae_imputed3[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-27.756 -12.382  -3.024   7.011  62.110 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.44753    6.74087   2.737 0.006829 ** 
sizemedium   2.41814    3.30141   0.732 0.464842    
sizesmall    9.90220    3.53694   2.800 0.005674 ** 
mnO2         0.90602    0.60284   1.503 0.134610    
Cl          -0.06014    0.03871  -1.553 0.122062    
NO3         -2.22610    0.63408  -3.511 0.000565 ***
oPO4        -0.04979    0.01622  -3.069 0.002483 ** 
Chla        -0.13859    0.06592  -2.102 0.036910 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.99 on 180 degrees of freedom
Multiple R-squared:  0.3505,    Adjusted R-squared:  0.3252 
F-statistic: 13.88 on 7 and 180 DF,  p-value: 2.521e-14

summary(final9.lm)


Call:
lm(formula = a1 ~ size + Cl + NO3 + PO4 + Chla, data = algae_imputed3[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-29.080 -11.361  -3.319   7.220  64.482 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 27.32015    3.29907   8.281 2.63e-14 ***
sizemedium   3.66746    3.30239   1.111 0.268236    
sizesmall   10.30097    3.54241   2.908 0.004094 ** 
Cl          -0.06823    0.03815  -1.789 0.075357 .  
NO3         -1.78109    0.61816  -2.881 0.004439 ** 
PO4         -0.04638    0.01167  -3.976 0.000101 ***
Chla        -0.10315    0.06739  -1.531 0.127606    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.02 on 181 degrees of freedom
Multiple R-squared:  0.345, Adjusted R-squared:  0.3233 
F-statistic: 15.89 on 6 and 181 DF,  p-value: 1.224e-14

The models are different to those from before as chlorophyll has been added and maximum pH removed and this reduces the adjusted R-squared. There are also 10 fewer cases in the dataset. The problem is that the filter did not remove the single outliers which is what we wanted. It removed all the NA values as well. This means we need to use another way of removing the cases that we do not want which are rows 133 and 152.

clean_algae3 <- filter(clean_algae, !row_number() %in% c(133,152)) 
algae_imputed4 <- knnImputation(clean_algae3, k=10, meth = "median")

model10.a1 <- lm(a1 ~. ,data=algae_imputed4[, 1:12])

final10.lm <- step(model10.a1)

model11.a1 <- lm(a1  ~. -oPO4 -NH4,data=algae_imputed4 [,1:12])

final11.lm <- stepAIC(model11.a1, direction = "both")

summary(final10.lm)


Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = algae_imputed4[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.654 -12.482  -3.226   7.424  62.889 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.43686   21.11673   2.862  0.00468 ** 
sizemedium   3.19289    3.40125   0.939  0.34906    
sizesmall   10.44185    3.81161   2.739  0.00674 ** 
mxPH        -4.14097    2.49798  -1.658  0.09903 .  
Cl          -0.06698    0.03928  -1.705  0.08977 .  
NO3         -1.70709    0.63551  -2.686  0.00787 ** 
PO4         -0.05237    0.01159  -4.517  1.1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 189 degrees of freedom
Multiple R-squared:  0.3636,    Adjusted R-squared:  0.3434 
F-statistic:    18 on 6 and 189 DF,  p-value: < 2.2e-16

summary(final11.lm)


Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = algae_imputed4[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.654 -12.482  -3.226   7.424  62.889 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.43686   21.11673   2.862  0.00468 ** 
sizemedium   3.19289    3.40125   0.939  0.34906    
sizesmall   10.44185    3.81161   2.739  0.00674 ** 
mxPH        -4.14097    2.49798  -1.658  0.09903 .  
Cl          -0.06698    0.03928  -1.705  0.08977 .  
NO3         -1.70709    0.63551  -2.686  0.00787 ** 
PO4         -0.05237    0.01159  -4.517  1.1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 189 degrees of freedom
Multiple R-squared:  0.3636,    Adjusted R-squared:  0.3434 
F-statistic:    18 on 6 and 189 DF,  p-value: < 2.2e-16

This worked in the way we expected it to and gave the result that we expected. This is very similar to the model with imputation before outlier removal.

An Estimate of Model Performance

In this case we are trying to predict values and we can compare the prediction with a set of data as we do with all linear regression models. You see how well the predicted values fit the values you used to train the dataset. This uses the predict function in R. then we can calculate the residuals as a mean absolute error, a mean squared error or a normalised mean square error.

One of the weaknesses of the linear model is that it predicts negative values for the amount of algal bloom which are not possible. You can improve the performance of the model by excluding impossible values. Constraints on data are a very important part of effective data mining and they need to be considered from the start.

lm4.predictions.a1 <- predict(final4.lm, algae_imputed1)
lm4.predictions.a1 <- ifelse(lm4.predictions.a1 < 0, 0, lm4.predictions.a1)

mae.4.a1 <- mean(abs(lm4.predictions.a1 - algae_imputed1$a1))

mse.4.a1 <- mean((lm4.predictions.a1 - algae_imputed1$a1)^2)

nmse.4.a1 <- mean((lm4.predictions.a1 - algae_imputed1$a1)^2)/mean((mean(algae_imputed1$a1)-algae_imputed1$a1)^2)

lm7.predictions.a1 <- predict(final7.lm, algae_imputed2)
lm7.predictions.a1 <- ifelse(lm7.predictions.a1 < 0, 0, lm7.predictions.a1)

mae.7.a1 <- mean(abs(lm7.predictions.a1 - algae_imputed2$a1))

mse.7.a1 <- mean((lm7.predictions.a1 - algae_imputed2$a1)^2)

nmse.7.a1 <- mean((lm7.predictions.a1 - algae_imputed2$a1)^2)/mean((mean(algae_imputed2$a1)-algae_imputed2$a1)^2)

lm11.predictions.a1 <- predict(final11.lm, algae_imputed4)
lm11.predictions.a1 <- ifelse(lm11.predictions.a1 < 0, 0, lm11.predictions.a1)

mae.11.a1 <- mean(abs(lm11.predictions.a1 - algae_imputed4$a1))

mse.11.a1 <- mean((lm11.predictions.a1 - algae_imputed4$a1)^2)

nmse.11.a1 <- mean((lm11.predictions.a1 - algae_imputed4$a1)^2)/mean((mean(algae_imputed4$a1)-algae_imputed4$a1)^2)

model <-c(4,7,11)
mae <- c(mae.4.a1, mae.7.a1, mae.11.a1)
mse <- c(mse.4.a1, mse.7.a1, mse.11.a1)
nmse <- c(nmse.4.a1, nmse.7.a1, nmse.11.a1)

perf.rt <- tibble(model,mae, mse, nmse)
perf.rt

# A tibble: 3 × 4
  model   mae   mse  nmse
  <dbl> <dbl> <dbl> <dbl>
1     4  12.5  286. 0.627
2     7  12.4  285. 0.622
3    11  12.4  285. 0.622

Model 4 is the first reduced model taking into account collinearity but without the outliers removed. Model 7 has the outliers removed after imputation. Model 11 has the outliers removed before imputation.

All of the models only include PO4, NO3, Cl mxPH and size medium or small. None of the other variables contribute to the models.

A serious concern for anyone understanding the chemical and biological background is the direction of the models. They all have coefficients which are negative and an intercept which is high (an algal bloom is the default). This is not what you are expecting or hoping for. You expect increased chemical concentrations to be contributing to the blooms but the models are all negative slopes. This means there are serious issues with the models and they are all missing some essential variables.

These models only explain about 1/3 of the variability in algal amounts and that other 2/3rds looks to be present in other variables.

If you look at the density fitting they have most of their density at 0 but with a long tail. This suggests a sporadic process that is not a response to the predictive variables. The correlations show that the algal species are mostly competitive except species 5 and 6 which have a cooperative relationship.

library(GGally)
ggpairs(algae_imputed4,columns=12:18)

For example the predictions for species a5 are terrible if you do not include a6 in the linear model.

model1.a5 <- lm(a5~. ,data=algae_imputed4 [,c(1:11,16)])

final1.lm.a5 <- stepAIC(model1.a5, direction = "both")

model2.a5 <- lm(a5~. ,data=algae_imputed4 [,c(1:11,16,17)])

final2.lm.a5 <- stepAIC(model2.a5, direction = "both")

summary(final1.lm.a5)


Call:
lm(formula = a5 ~ size + speed + mnO2 + Cl + NO3 + PO4, data = algae_imputed4[, 
    c(1:11, 16)])

Residuals:
    Min      1Q  Median      3Q     Max 
-11.744  -3.417  -1.169   2.185  37.785 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.961367   2.927342  -2.720 0.007151 ** 
sizemedium   3.361521   1.330897   2.526 0.012375 *  
sizesmall   -0.212296   1.424548  -0.149 0.881693    
speedlow    -3.468379   1.624749  -2.135 0.034086 *  
speedmedium -0.796309   1.197263  -0.665 0.506801    
mnO2         0.883615   0.247088   3.576 0.000444 ***
Cl           0.026668   0.015144   1.761 0.079874 .  
NO3          0.640218   0.249315   2.568 0.011012 *  
PO4          0.011402   0.004929   2.313 0.021809 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.556 on 187 degrees of freedom
Multiple R-squared:  0.2761,    Adjusted R-squared:  0.2451 
F-statistic: 8.914 on 8 and 187 DF,  p-value: 2.527e-10

summary(final2.lm.a5)


Call:
lm(formula = a5 ~ size + mnO2 + NO3 + oPO4 + a6, data = algae_imputed4[, 
    c(1:11, 16, 17)])

Residuals:
    Min      1Q  Median      3Q     Max 
-13.266  -2.945  -0.948   1.549  36.837 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.316966   2.270557  -3.223 0.001496 ** 
sizemedium   4.609841   1.201162   3.838 0.000169 ***
sizesmall    1.526601   1.232470   1.239 0.217011    
mnO2         0.683166   0.216669   3.153 0.001880 ** 
NO3          0.408788   0.229950   1.778 0.077057 .  
oPO4         0.015724   0.005723   2.747 0.006589 ** 
a6           0.236134   0.046789   5.047 1.05e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.249 on 189 degrees of freedom
Multiple R-squared:  0.3353,    Adjusted R-squared:  0.3142 
F-statistic: 15.89 on 6 and 189 DF,  p-value: 9.318e-15

The results for species a7 are terrible if they are based on the predictors with a very slight improvement if you include the correlated algal species 1. This tells you that the data collected has not explanatory value for algal species 7 and there is a failure to capture any relevant measurements.

model1.a7 <- lm(a7~. ,data=algae_imputed4 [,c(1:11,18)])

final1.lm.a7 <- stepAIC(model1.a7, direction = "both")

model2.a7 <- lm(a7~. ,data=algae_imputed4 [,c(1:18)])

final2.lm.a7 <- stepAIC(model2.a7, direction = "both")

summary(final1.lm.a7)


Call:
lm(formula = a7 ~ size + mxPH + mnO2 + Cl + NO3, data = algae_imputed4[, 
    c(1:11, 18)])

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7400 -2.4163 -1.2113  0.7159 25.3835 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 13.75736    6.21559   2.213  0.02807 * 
sizemedium   1.11294    0.96889   1.149  0.25214   
sizesmall   -1.27061    1.08797  -1.168  0.24433   
mxPH        -1.01979    0.70723  -1.442  0.15097   
mnO2        -0.39553    0.16239  -2.436  0.01579 * 
Cl          -0.02636    0.01106  -2.382  0.01821 * 
NO3          0.52212    0.18127   2.880  0.00443 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.97 on 189 degrees of freedom
Multiple R-squared:  0.1163,    Adjusted R-squared:  0.08828 
F-statistic: 4.147 on 6 and 189 DF,  p-value: 0.0006136

summary(final2.lm.a7)


Call:
lm(formula = a7 ~ size + mxPH + Cl + NO3 + a1 + a5, data = algae_imputed4[, 
    c(1:18)])

Residuals:
   Min     1Q Median     3Q    Max 
-6.362 -2.718 -1.123  1.079 25.107 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.48928    6.03091   2.403 0.017256 *  
sizemedium   1.91328    0.97289   1.967 0.050701 .  
sizesmall   -0.57013    1.08469  -0.526 0.599777    
mxPH        -1.37343    0.70108  -1.959 0.051590 .  
Cl          -0.02367    0.01050  -2.254 0.025343 *  
NO3          0.40161    0.18375   2.186 0.030073 *  
a1          -0.06630    0.01964  -3.377 0.000892 ***
a5          -0.13521    0.05225  -2.588 0.010414 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.865 on 188 degrees of freedom
Multiple R-squared:  0.1578,    Adjusted R-squared:  0.1265 
F-statistic: 5.033 on 7 and 188 DF,  p-value: 2.997e-05

Turning the Models Around

As I said above it is a serious concern that the intercepts are positive and the slopes are negative for the algae. This is a reverse of the expected cause and effect. This indicates that you have high levels of nitrates and phosphates when there are not algal blooms and they are lower during blooms. This would indicate the blooms and consuming all of the available nutrients when they occur. This is a possible but not useful as you cannot use this as a prediction as you are seeing the consequences/effects and not the causes.

If this is true can we use the data from the algal blooms to predict the amounts of phosphate and nitrate? The answer is yes and more effectively than we predict the algal numbers. The other variables chloride, pH etc,, are not going to be affected by the blooms and they can be left in the models.

cm <- cor(algae_imputed4[,4:18],use="complete.obs")
corrplot(cm,type="upper",tl.pos="d",tl.cex=0.75)
corrplot(cm,add=TRUE, type="lower", method="number",tl.cex=0.75, diag=FALSE,tl.pos="n", cl.pos="n")

model1.PO4 <- lm(PO4 ~., data=algae_imputed4 [,c(1:6,10,12:18)])
final1.PO4 <- stepAIC(model1.PO4, direction="both")

model1.oPO4 <- lm(oPO4 ~., data=algae_imputed4 [,c(1:6,9,12:18)])
final1.oPO4 <- stepAIC(model1.oPO4, direction="both")

model1.NO3 <- lm(NO3 ~., data=algae_imputed4 [,c(1:6,7,12:18)])
final1.NO3 <- stepAIC(model1.NO3, direction="both")

model1.NH4 <- lm(NH4 ~., data=algae_imputed4 [,c(1:6,8,12:18)])
final1.NH4 <- stepAIC(model1.NH4, direction="both")

The Adjusted R-squared values and the ANOVA tables are a lot better for these predictions than they were for the algal predictions and in this case there are only 4 variables to predict. The results also make biological sense in terms of river speed, oxygen levels and accumulation of nitrate and phosphate

summary(final1.PO4)


Call:
lm(formula = PO4 ~ size + speed + mxPH + mnO2 + Cl + a1 + a4 + 
    a5, data = algae_imputed4[, c(1:6, 10, 12:18)])

Residuals:
    Min      1Q  Median      3Q     Max 
-187.06  -58.94   -7.36   45.45  386.37 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -31.9116   123.2307  -0.259 0.795955    
sizemedium   36.1833    18.9444   1.910 0.057683 .  
sizesmall    49.7727    21.6545   2.298 0.022652 *  
speedlow     67.3781    22.4802   2.997 0.003099 ** 
speedmedium  42.2021    16.4445   2.566 0.011070 *  
mxPH         23.6262    13.4558   1.756 0.080770 .  
mnO2        -12.1845     3.4296  -3.553 0.000484 ***
Cl            0.6779     0.2049   3.308 0.001129 ** 
a1           -1.3710     0.3757  -3.650 0.000342 ***
a4            6.5213     1.6630   3.921 0.000124 ***
a5            2.4719     1.0321   2.395 0.017625 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91.88 on 185 degrees of freedom
Multiple R-squared:  0.5173,    Adjusted R-squared:  0.4912 
F-statistic: 19.82 on 10 and 185 DF,  p-value: < 2.2e-16

anova(final1.PO4)

Analysis of Variance Table

Response: PO4
           Df  Sum Sq Mean Sq F value    Pr(>F)    
size        2  163590   81795  9.6899 9.951e-05 ***
speed       2  526847  263424 31.2068 2.097e-12 ***
mxPH        1   44020   44020  5.2149 0.0235302 *  
mnO2        1  343726  343726 40.7199 1.374e-09 ***
Cl          1  240909  240909 28.5396 2.676e-07 ***
a1          1  187289  187289 22.1874 4.843e-06 ***
a4          1  118562  118562 14.0456 0.0002384 ***
a5          1   48414   48414  5.7354 0.0176246 *  
Residuals 185 1561625    8441                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(final1.oPO4)


Call:
lm(formula = oPO4 ~ size + mxPH + mnO2 + Cl + a1 + a4 + a5, data = algae_imputed4[, 
    c(1:6, 9, 12:18)])

Residuals:
    Min      1Q  Median      3Q     Max 
-132.29  -39.63   -8.09   23.85  325.19 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.0789    96.1521  -0.334 0.739035    
sizemedium  -10.9682    14.5784  -0.752 0.452783    
sizesmall    25.6041    16.2941   1.571 0.117787    
mxPH         19.8993    10.6499   1.868 0.063257 .  
mnO2         -9.0383     2.6349  -3.430 0.000742 ***
Cl            0.5480     0.1567   3.498 0.000586 ***
a1           -0.9491     0.2969  -3.196 0.001633 ** 
a4            4.6073     1.3177   3.496 0.000589 ***
a5            1.6602     0.8121   2.044 0.042339 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 72.87 on 187 degrees of freedom
Multiple R-squared:  0.389, Adjusted R-squared:  0.3629 
F-statistic: 14.88 on 8 and 187 DF,  p-value: < 2.2e-16

anova(final1.oPO4)

Analysis of Variance Table

Response: oPO4
           Df Sum Sq Mean Sq F value    Pr(>F)    
size        2   7064    3532  0.6651 0.5154526    
mxPH        1  36926   36926  6.9532 0.0090707 ** 
mnO2        1 258806  258806 48.7328 4.957e-11 ***
Cl          1 158551  158551 29.8548 1.472e-07 ***
a1          1  88981   88981 16.7549 6.322e-05 ***
a4          1  59845   59845 11.2687 0.0009548 ***
a5          1  22191   22191  4.1786 0.0423393 *  
Residuals 187 993105    5311                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(final1.NO3)


Call:
lm(formula = NO3 ~ size + speed + mxPH + mnO2 + Cl + a1 + a2 + 
    a5 + a6 + a7, data = algae_imputed4[, c(1:6, 7, 12:18)])

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4584 -1.1123 -0.2841  0.8988  6.8667 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.428309   2.432584   1.820 0.070330 .  
sizemedium   0.841558   0.382915   2.198 0.029221 *  
sizesmall    0.814245   0.436454   1.866 0.063699 .  
speedlow     0.982027   0.448225   2.191 0.029721 *  
speedmedium  0.581519   0.330902   1.757 0.080526 .  
mxPH        -0.666484   0.273425  -2.438 0.015744 *  
mnO2         0.185573   0.066405   2.795 0.005751 ** 
Cl           0.016331   0.004180   3.907 0.000132 ***
a1          -0.013383   0.008143  -1.643 0.102009    
a2           0.024573   0.013588   1.808 0.072174 .  
a5           0.048654   0.022114   2.200 0.029047 *  
a6           0.035389   0.014878   2.379 0.018410 *  
a7           0.064879   0.027011   2.402 0.017309 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.819 on 183 degrees of freedom
Multiple R-squared:  0.3921,    Adjusted R-squared:  0.3522 
F-statistic: 9.835 on 12 and 183 DF,  p-value: 1.057e-14

anova(final1.NO3)

Analysis of Variance Table

Response: NO3
           Df Sum Sq Mean Sq F value    Pr(>F)    
size        2  56.87  28.436  8.5898 0.0002719 ***
speed       2  46.04  23.019  6.9533 0.0012287 ** 
mxPH        1  16.59  16.586  5.0102 0.0264038 *  
mnO2        1  25.91  25.915  7.8281 0.0056946 ** 
Cl          1 123.23 123.231 37.2247 6.128e-09 ***
a1          1  59.64  59.637 18.0148 3.480e-05 ***
a2          1   2.35   2.353  0.7108 0.4002881    
a5          1  23.17  23.169  6.9987 0.0088665 ** 
a6          1  17.81  17.811  5.3803 0.0214689 *  
a7          1  19.10  19.099  5.7693 0.0173089 *  
Residuals 183 605.82   3.310                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(final1.NH4)


Call:
lm(formula = NH4 ~ speed + mnO2 + a1 + a3 + a4, data = algae_imputed4[, 
    c(1:6, 8, 12:18)])

Residuals:
    Min      1Q  Median      3Q     Max 
-2630.5  -340.8   -93.1   129.9  5809.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  891.498    353.303   2.523  0.01245 *  
speedlow     -32.952    194.338  -0.170  0.86554    
speedmedium  340.198    145.104   2.345  0.02009 *  
mnO2         -66.992     31.150  -2.151  0.03277 *  
a1            -5.523      3.056  -1.807  0.07230 .  
a3           -27.210      9.241  -2.944  0.00364 ** 
a4            92.151     14.622   6.302 2.02e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 848.3 on 189 degrees of freedom
Multiple R-squared:  0.3225,    Adjusted R-squared:  0.301 
F-statistic: 14.99 on 6 and 189 DF,  p-value: 5.281e-14

anova(final1.NH4)

Analysis of Variance Table

Response: NH4
           Df    Sum Sq  Mean Sq F value    Pr(>F)    
speed       2  14057493  7028747  9.7671 9.191e-05 ***
mnO2        1  11993762 11993762 16.6665 6.572e-05 ***
a1          1    991825   991825  1.3782  0.241879    
a3          1   9105501  9105501 12.6530  0.000474 ***
a4          1  28583527 28583527 39.7196 2.016e-09 ***
Residuals 189 136010765   719634                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From these models you can get an understanding of the critical factors for algal blooms. They occur at lower concentrations in smaller and slower rivers. This is what we would expect. You can also look at the intercept where there are no algae and determine the levels of phosphate and nitrate that a river can safely support before an algal bloom is likely.

You could also generate a new version of the model where the algal levels are turned into a binary variable - there is a bloom or there isn’t and this might be a more useful tool and create better models for prediction and fitting.

Key Lessons

Know your data, data out of context cannot produce reliable insights.
Ask the right questions of the model what should the model predict?
Think about causes and effects.
The iterative design of data collection is essential to collect all the data that you need.
Never underestimate the amount of human input necessary for successful machine learning. You can’t find the right models if the model is not within the data.