The goal of this analysis is to see if we can predict the concentration of chlorophyll a (one of the major pigments involved in photosynthesis) in a lake. Chlorophyll a is a good indicator of the amount of phytoplankton in a lake or pond. These are the organisms that are sometimes responsible for a harmful algal bloom. We’ll use a number of predictor variables that are commonly collected in surveys of lakes and ponds to try to predict chlorophyll a.

Predictors:

Lake size
Shoreline development index (complexity of the shoreline)
Depth, maximum
Day of the year (Julian)
Total phosphorus
Conductivity
Alkalinity
pH
Secchi depth (i.e., water clarity)
Dissolved oxygen
Temperature

We’ll use a machine learning approach called a Random Forest. This method is particularly good for situations where there are non-linear relationships in the data or you don’t want to assume a particular probability model.

We’ll be using data from the U.S. Environmental Protection Agency (EPA) National Lake Assessment. This data set includes over 1000 surveys of lakes and ponds across the US in 2007. A more detailed description of the data and direct URLs for downloading can be found here.

Download & import data

First, let’s get some data. We’ll use the package RCurl to download the raw .csv data directly from the EPA’s website. We’ll download data from three different data sets to get all of the variables that we want.

library(RCurl)

## Loading required package: bitops

# this contains pH, conductivity, total phosphorus, secchi depth, and chlorophyll a
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_WaterQuality_20091123.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_water_qual <- read.csv(textConnection(X))

# this contains lake size, shoreline development index, and maximum depth 
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_SampledLakeInformation_20091113.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_lake_info <- read.csv(textConnection(X))

# this contains dissolved oxygen and temperature 
# NOTE: this is the entire depth profile (i.e., multiple readings for each water body surveyed) 
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_Profile_20091008.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_profile <- read.csv(textConnection(X))

rm(list=c("X","URL")) # clean up your workspace

Examine characteristics of the data

Before we start analyzing anything, let’s check out some features of the data. First, let’s look at all of the variable names.

names(data_water_qual)

##   [1] "SITE_ID"          "YEAR"             "VISIT_NO"        
##   [4] "SITE_TYPE"        "LAKE_SAMP"        "SAMPLED"         
##   [7] "DATE_COL"         "SAMPLE_CATEGORY"  "DATECHEM"        
##  [10] "SAMPLED_CHEM"     "INDXSAMP_CHEM"    "PH_FIELD"        
##  [13] "PH_FIELD_DEPTH"   "SAMPLE_DEPTH"     "SAMPLE_ID_CHEM"  
##  [16] "LAB_ID_CHEM"      "PH_LAB"           "PHLAB_FLAG"      
##  [19] "COND"             "COND_FLAG"        "COND_RL_ALERT"   
##  [22] "ANC"              "ANC_FLAG"         "TURB"            
##  [25] "TURB_FLAG"        "TURB_RL_ALERT"    "TOC"             
##  [28] "TOC_FLAG"         "TOC_RL_ALERT"     "DOC"             
##  [31] "DOC_FLAG"         "DOC_RL_ALERT"     "NH4N_PPM"        
##  [34] "NH4"              "NH4_FLAG"         "NH4_RL_ALERT"    
##  [37] "NO3_NO2"          "NO3NO2_FLAG"      "NO3NO2_RL_ALERT" 
##  [40] "NTL_PPM"          "NTL"              "NTL_FLAG"        
##  [43] "NTL_RL_ALERT"     "PTL"              "PTL_FLAG"        
##  [46] "PTL_RL_ALERT"     "CL_PPM"           "CL"              
##  [49] "CL_FLAG"          "CL_RL_ALERT"      "NO3N_PPM"        
##  [52] "NO3"              "NO3_FLAG"         "NO3_RL_ALERT"    
##  [55] "SO4_PPM"          "SO4"              "SO4_FLAG"        
##  [58] "SO4_RL_ALERT"     "CA_PPM"           "CA"              
##  [61] "CA_FLAG"          "CA_RL_ALERT"      "MG_PPM"          
##  [64] "MG"               "MG_FLAG"          "MG_RL_ALERT"     
##  [67] "NA_PPM"           "NA."              "NA_FLAG"         
##  [70] "NA_RL_ALERT"      "K_PPM"            "K"               
##  [73] "K_FLAG"           "K_RL_ALERT"       "COLOR"           
##  [76] "COLOR_FLAG"       "COLOR_RL_ALERT"   "SIO2"            
##  [79] "SIO2_FLAG"        "SIO2_RL_ALERT"    "H"               
##  [82] "OH"               "NH4ION"           "CATSUM"          
##  [85] "ANSUM2"           "ANDEF2"           "SOBC"            
##  [88] "BALANCE2"         "ORGION"           "CONCAL2"         
##  [91] "CONDHO2"          "DAY_SHIP"         "DAYSHIP_ALERT"   
##  [94] "DATE_RECEIVED"    "DATE_FILTERED"    "DATEFILT_ALERT"  
##  [97] "PHLAB_HT"         "PHLAB_HT_ALERT"   "COND_HT"         
## [100] "COND_HT_ALERT"    "ANC_HT"           "ANC_HT_ALERT"    
## [103] "TURB_HT"          "TURB_HT_ALERT"    "TOC_HT"          
## [106] "TOC_HT_ALERT"     "DOC_HT"           "DOC_HT_ALERT"    
## [109] "NH4_HT"           "NH4_HT_ALERT"     "NO3NO2_HT"       
## [112] "NO3NO2_HT_ALERT"  "NTL_HT"           "NTL_HT_ALERT"    
## [115] "PTL_HT"           "PTL_HT_ALERT"     "CL_HT"           
## [118] "CL_HT_ALERT"      "NO3_HT"           "NO3_HT_ALERT"    
## [121] "SO4_HT"           "SO4_HT_ALERT"     "CA_HT"           
## [124] "CA_HT_ALERT"      "MG_HT"            "MG_HT_ALERT"     
## [127] "NA_HT"            "NA_HT_ALERT"      "K_HT"            
## [130] "K_HT_ALERT"       "COLOR_HT"         "COLOR_HT_ALERT"  
## [133] "SIO2_HT"          "SIO2_HT_ALERT"    "COM_FIELD_PH"    
## [136] "COMMENT_FLD_CHEM" "COMMENT_LAB_CHEM" "COMMENT_IM_CHEM" 
## [139] "INDXSAMP_CHLA"    "DATECHLA"         "SAMPLED_CHLA"    
## [142] "CHLA"             "SAMPLE_ID_CHLA"   "LAB_ID_CHLA"     
## [145] "CHLA_RL_ALERT"    "FLAG_FLD_CHLA"    "COMMENT_FLD_CHLA"
## [148] "FLAG_LAB_CHLA"    "COMMENT_LAB_CHLA" "CHLA_HT_ALERT"   
## [151] "CHLA_HT"          "DATESECCHI"       "SAMPLED_SECCHI"  
## [154] "SECMEAN"          "CLEAR_TO_BOTTOM"  "COMMENT_SECCHI"  
## [157] "VISIT_ID"         "FLAG_SECCHI"

names(data_lake_info)

##  [1] "SITE_ID"         "VISIT_NO"        "SAMPLED"        
##  [4] "DATE_COL"        "REPEAT"          "SITE_TYPE"      
##  [7] "LAKE_SAMP"       "TNT"             "LON_DD"         
## [10] "LAT_DD"          "ALBERS_X"        "ALBERS_Y"       
## [13] "FLD_LON_DD"      "FLD_LAT_DD"      "FLD_SRC"        
## [16] "FLD_FLAG"        "ST"              "STATE_NAME"     
## [19] "CNTYNAME"        "EPA_REG"         "NHDNAME"        
## [22] "LAKENAME"        "AREA_CAT7"       "NESLAKE"        
## [25] "NESLAKE_ID"      "STRATUM"         "PANEL"          
## [28] "DSGN_CAT"        "MDCATY"          "WGT"            
## [31] "WGT_NLA"         "ADJWGT_CAT"      "URBAN"          
## [34] "WSA_ECO3"        "WSA_ECO9"        "ECO_LEV_3"      
## [37] "ECO_L3_NAM"      "NUT_REG"         "NUTREG_NAME"    
## [40] "ECO_NUTA"        "LAKE_ORIGIN"     "ECO3_X_ORIGIN"  
## [43] "REF_CLUSTER"     "REFCLUS_NAME"    "RT_NLA"         
## [46] "REF_NUTR"        "AREA_HA"         "SIZE_CLASS"     
## [49] "LAKEAREA"        "LAKEPERIM"       "SLD"            
## [52] "DEPTH_X"         "DEPTHMAX"        "ELEV_PT"        
## [55] "HUC_2"           "HUC_8"           "REACHCODE"      
## [58] "COM_ID"          "INDEX_SAMP"      "STATUS_VER"     
## [61] "STATUS_FLD"      "STATUS_DSK"      "PERM_WATER"     
## [64] "NON_SALINE"      "SRFC_AREA"       "METER_DEEP"     
## [67] "OPEN_WATER"      "AQUACULTUR"      "DISPOSAL"       
## [70] "SEWAGE"          "EVAPORATE"       "PHYS_ACCES"     
## [73] "FLAG_INFO"       "COMMENT_INFO"    "SAMPLED_PROFILE"
## [76] "SAMPLED_SECCHI"  "SAMPLED_ASSESS"  "SAMPLED_PHAB"   
## [79] "INDXSAMP_PHAB"   "SAMPLED_CHEM"    "INDXSAMP_CHEM"  
## [82] "SAMPLED_CHLA"    "INDXSAMP_CHLA"   "SAMPLED_ZOOP"   
## [85] "INDXSAMP_ZOOP"   "SAMPLED_PHYT"    "INDXSAMP_PHYT"  
## [88] "SAMPLED_CORE"    "INDXSAMP_CORE"   "SAMPLED_INF"    
## [91] "INDXSAMP_INF"    "SAMPLED_ENTE"    "INDXSAMP_ENTE"  
## [94] "SAMPLED_MICR"    "INDXSAMP_MICR"   "SAMPLED_SDHG"   
## [97] "INDXSAMP_SDHG"   "VISIT_ID"        "FID_1"

names(data_profile)

##  [1] "SITE_ID"         "YEAR"            "VISIT_NO"       
##  [4] "DATE_PROFILE"    "SAMPLED_PROFILE" "DEPTH"          
##  [7] "METALIMNION"     "TEMP_FIELD"      "DO_FIELD"       
## [10] "PH_FIELD"        "COND_FIELD"      "FLAG_PROFILE"   
## [13] "COMMENT_PROFILE"

You can see there actually are a lot of possible predictors in this data set. We’ll limit our models for now to some of the most common variables that are collected in lake surveys.

Let’s also look at the size of these data sets

dim(data_water_qual)

## [1] 1326  158

dim(data_lake_info)

## [1] 1252   99

dim(data_profile)

## [1] 12670    13

Now, let’s look at our response variable: chlorophylla a.

par(mfrow=c(1,2))
hist(data_water_qual$CHLA)
hist(log(data_water_qual$CHLA))

summary(data_water_qual$CHLA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.067   2.867   7.700  30.120  26.350 936.000       5

It looks like there are some pretty extreme values of chlorophyll a.

Merge data frames

Now we’re going to combine all of those data sets into one. First, we need to make a unique identifier for the lake and survey date.

data_water_qual$ID <- paste(data_water_qual$SITE_ID, data_water_qual$DATE_COL, sep="_")
data_lake_info$ID <- paste(data_lake_info$SITE_ID, data_lake_info$DATE_COL, sep="_")
data_profile$ID <- paste(data_profile$SITE_ID, data_profile$DATE_PROFILE, sep="_")

If you inspected data_profile closer, you would notice that it has an observation for every depth so let’s subset to just get the surface measurements. This way we have just a single observation of dissolved oxygen and temperature for each lake.

data_profile <- subset(data_profile, data_profile$DEPTH ==0)

Now, let’s merge data frames based on the column ID.

data_merge <- merge(data_water_qual, data_lake_info, by=c("ID")) # merge the first two 
data_merge <- merge(data_merge, data_profile, by=c("ID")) # then merge the result to the third

Let’s see how many observations we end up with:

dim(data_merge)

## [1] 1253  271

Clean up a few things

Remove any observations that are missing CHLA

data_merge <- data_merge[!rowSums(is.na(data_merge["CHLA"])), ]

Determine the Julian day of the year. strftime() is a function to convert date formats. We want to go from month/day/year to the day of the year, this way it can be a neat and tidy predictor in our model.

data_merge$DATE <- strftime(as.Date(data_merge$DATE_COL.x, "%m/%d/%Y"), format = "%j")
data_merge$DATE <- as.numeric(data_merge$DATE)

Split data into training and testing sets

We’re going to train the model on a portion of the data and then test it on the remainder. Use the createDataPartition() function the package caret. Use the argument p to Split the data into 75% of the observations for training the model and the rest for testing.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

inTrain <- createDataPartition(y=data_merge$CHLA, p=0.75, list=FALSE) # use 75% of data to train to the model 
training <- data_merge[inTrain,]
testing <- data_merge[-inTrain,]

Plotting univariate relationships

Before we get into a more complicated analyis let’s check out some of the relationships between each of the single predictors and chlorophyll a.

library(ggplot2)

predict_vars <- c("SECMEAN","PH_LAB","COND","PTL","TEMP_FIELD","DO_FIELD","DEPTHMAX","LAKEAREA","SLD","DATE")

for(i in unique(predict_vars)){
  plots <- ggplot(training,aes_string(y="CHLA",x=i)) + geom_point(alpha=0.4)
  plots <- plots + scale_x_log10() + scale_y_log10()
  print(plots)
}

## Warning: Removed 105 rows containing missing values (geom_point).

## Warning: Removed 41 rows containing missing values (geom_point).

It looks like some of the predictors have a pretty linear relationship with chlorophyll a. Those will probably be good predictors.

Fit a random forest

Now we’re actually going to fit a Random Forest. We’ll use the package caret again. caret is a nice wrapper package for many machine learning techniques. There are lots of othe packages for random forests and caret will actually call randomForest directly.

library(caret)

set.seed(12321)

modFit_RF <- train(CHLA ~ SECMEAN + PH_LAB + COND + PTL + TEMP_FIELD + DO_FIELD + DEPTHMAX + 
                     LAKEAREA + SLD + DATE, 
                   method="rf", 
                   data=training)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

finMod_RF <- modFit_RF$finalModel
print(modFit_RF)

## Random Forest 
## 
## 938 samples
## 271 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 799, 799, 799, 799, 799, 799, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   RMSE SD   Rsquared SD
##    2    48.24719  0.5047018  8.591854  0.07093048 
##    6    50.65210  0.4713491  7.577021  0.08100372 
##   10    52.33093  0.4520417  8.178208  0.08681661 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 2.

We just tuned the model to pick the optimal parametr for mtry which is the number of variables randomly sampled to use at each split in the tree. We did this by bootstraping our data (i.e., resampling with replacement), trying different values for mtry, and finding the optimal value by minimizing the RMSE (root mean square error, a measure of accuracy).

Let’s look at the results from the final model:

print(finMod_RF)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 2597.869
##                     % Var explained: 51.48

And here are the most important variables:

finMod_RF$importance

##            IncNodePurity
## SECMEAN         943640.1
## PH_LAB          458455.8
## COND            301595.2
## PTL             695665.2
## TEMP_FIELD      194740.3
## DO_FIELD        376788.9
## DEPTHMAX        435968.7
## LAKEAREA        169286.3
## SLD             201833.5
## DATE            156315.0

As expected, the water clarity (average secchi depth or SECMEAN) and nutrients (total phosphorus or PTL) are important predictors of chlorophyll a. It wasn’t obvious from the bivariate plots, but it looks like dissolved oxygen (DO_FIELD) is also an important predictor.

Observed vs. predicted chlorophyll a

Let’s see how well the Random Forest did. First, we need to remove the test cases that have missing predictors

cols <- c("SECMEAN","PH_LAB","COND","PTL","TEMP_FIELD","DO_FIELD","DEPTHMAX","LAKEAREA","SLD","DATE")
testing <- testing[!rowSums(is.na(testing[cols])), ]

Make predictions and add them to to the testing data frame

testing$pred_RF <- predict(modFit_RF, testing)

Calculate the root mean square error (RMSE) on the testing data

sqrt(sum((testing$pred_RF - testing$CHLA)^2))

## [1] 799.5624

Plot it

plot_RF_log <- ggplot(testing, aes(x=CHLA, y=pred_RF)) + geom_point(size=2) 
plot_RF_log <- plot_RF_log + geom_abline(slope=1,intercept=0,size=1,colour="red")
plot_RF_log <- plot_RF_log + xlab("Observed chl a") + ylab("Predicted chl a")
plot_RF_log <- plot_RF_log + theme_classic(base_size=18)
plot_RF_log <- plot_RF_log + expand_limits(x=0, y=0)
plot_RF_log <- plot_RF_log + scale_x_log10(expand=c(0,0)) + scale_y_log10(expand=c(0,0)) 
plot_RF_log

Not bad. Not great. Most observations are along the 1:1 line in red which would mean the predictions match the observations. Notice, there definitely are some cases where our predictions are way off from observed values. It also looks like there is a tendency for the random forest to predict values that are higher than observed.

Parallel Processing

While this data set is not huge (>1000 observations), fitting Random Forests on large data sets can be slow. Therefore, it’s often helpful to fit your model using all of the cores on your machine (by default R will only use one core).

There are several packages which allow for parallel processing in R. These include doSNOW, doMC, foreach, and many others. For this example, I will use doSNOW.

library(caret)
library(doSNOW)

cl <- makeCluster(4, type="SOCK") # set-up the cluster 
registerDoSNOW(cl)

# fit the model 
modFit_RF_par <- train(CHLA ~ SECMEAN + PH_LAB + COND + PTL + TEMP_FIELD + DO_FIELD + 
                         DEPTHMAX + LAKEAREA + SLD + DATE, 
                   method="rf", 
                   data=training)

stopCluster(cl)

Even bigger data

While 1200 or so observations might be pretty big for ecology, data sets can get a lot bigger! R doesn’t necessarily do a great job at dealing with huge data sets (e.g., 100,000s of rows). Two common packages used to jump this hurdle are bigmemory and ff. They both allow large datasets to be stored on the disk, rather than in RAM.

Here’s what it would look like if you wanted to repeat the data inport step with ff:

library(ff)

URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_WaterQuality_20091123.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_water_qual <- read.table.ffdf(file=textConnection(X), FUN="read.csv")

Machine Learning & Lakes