The goal of this analysis is to see if we can predict the concentration of chlorophyll a (one of the major pigments involved in photosynthesis) in a lake. Chlorophyll a is a good indicator of the amount of phytoplankton in a lake or pond. These are the organisms that are sometimes responsible for a harmful algal bloom. We’ll use a number of predictor variables that are commonly collected in surveys of lakes and ponds to try to predict chlorophyll a.
Predictors:
We’ll use a machine learning approach called a Random Forest. This method is particularly good for situations where there are non-linear relationships in the data or you don’t want to assume a particular probability model.
We’ll be using data from the U.S. Environmental Protection Agency (EPA) National Lake Assessment. This data set includes over 1000 surveys of lakes and ponds across the US in 2007. A more detailed description of the data and direct URLs for downloading can be found here.
First, let’s get some data. We’ll use the package RCurl
to download the raw .csv data directly from the EPA’s website. We’ll download data from three different data sets to get all of the variables that we want.
library(RCurl)
## Loading required package: bitops
# this contains pH, conductivity, total phosphorus, secchi depth, and chlorophyll a
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_WaterQuality_20091123.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_water_qual <- read.csv(textConnection(X))
# this contains lake size, shoreline development index, and maximum depth
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_SampledLakeInformation_20091113.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_lake_info <- read.csv(textConnection(X))
# this contains dissolved oxygen and temperature
# NOTE: this is the entire depth profile (i.e., multiple readings for each water body surveyed)
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_Profile_20091008.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_profile <- read.csv(textConnection(X))
rm(list=c("X","URL")) # clean up your workspace
Before we start analyzing anything, let’s check out some features of the data. First, let’s look at all of the variable names.
names(data_water_qual)
## [1] "SITE_ID" "YEAR" "VISIT_NO"
## [4] "SITE_TYPE" "LAKE_SAMP" "SAMPLED"
## [7] "DATE_COL" "SAMPLE_CATEGORY" "DATECHEM"
## [10] "SAMPLED_CHEM" "INDXSAMP_CHEM" "PH_FIELD"
## [13] "PH_FIELD_DEPTH" "SAMPLE_DEPTH" "SAMPLE_ID_CHEM"
## [16] "LAB_ID_CHEM" "PH_LAB" "PHLAB_FLAG"
## [19] "COND" "COND_FLAG" "COND_RL_ALERT"
## [22] "ANC" "ANC_FLAG" "TURB"
## [25] "TURB_FLAG" "TURB_RL_ALERT" "TOC"
## [28] "TOC_FLAG" "TOC_RL_ALERT" "DOC"
## [31] "DOC_FLAG" "DOC_RL_ALERT" "NH4N_PPM"
## [34] "NH4" "NH4_FLAG" "NH4_RL_ALERT"
## [37] "NO3_NO2" "NO3NO2_FLAG" "NO3NO2_RL_ALERT"
## [40] "NTL_PPM" "NTL" "NTL_FLAG"
## [43] "NTL_RL_ALERT" "PTL" "PTL_FLAG"
## [46] "PTL_RL_ALERT" "CL_PPM" "CL"
## [49] "CL_FLAG" "CL_RL_ALERT" "NO3N_PPM"
## [52] "NO3" "NO3_FLAG" "NO3_RL_ALERT"
## [55] "SO4_PPM" "SO4" "SO4_FLAG"
## [58] "SO4_RL_ALERT" "CA_PPM" "CA"
## [61] "CA_FLAG" "CA_RL_ALERT" "MG_PPM"
## [64] "MG" "MG_FLAG" "MG_RL_ALERT"
## [67] "NA_PPM" "NA." "NA_FLAG"
## [70] "NA_RL_ALERT" "K_PPM" "K"
## [73] "K_FLAG" "K_RL_ALERT" "COLOR"
## [76] "COLOR_FLAG" "COLOR_RL_ALERT" "SIO2"
## [79] "SIO2_FLAG" "SIO2_RL_ALERT" "H"
## [82] "OH" "NH4ION" "CATSUM"
## [85] "ANSUM2" "ANDEF2" "SOBC"
## [88] "BALANCE2" "ORGION" "CONCAL2"
## [91] "CONDHO2" "DAY_SHIP" "DAYSHIP_ALERT"
## [94] "DATE_RECEIVED" "DATE_FILTERED" "DATEFILT_ALERT"
## [97] "PHLAB_HT" "PHLAB_HT_ALERT" "COND_HT"
## [100] "COND_HT_ALERT" "ANC_HT" "ANC_HT_ALERT"
## [103] "TURB_HT" "TURB_HT_ALERT" "TOC_HT"
## [106] "TOC_HT_ALERT" "DOC_HT" "DOC_HT_ALERT"
## [109] "NH4_HT" "NH4_HT_ALERT" "NO3NO2_HT"
## [112] "NO3NO2_HT_ALERT" "NTL_HT" "NTL_HT_ALERT"
## [115] "PTL_HT" "PTL_HT_ALERT" "CL_HT"
## [118] "CL_HT_ALERT" "NO3_HT" "NO3_HT_ALERT"
## [121] "SO4_HT" "SO4_HT_ALERT" "CA_HT"
## [124] "CA_HT_ALERT" "MG_HT" "MG_HT_ALERT"
## [127] "NA_HT" "NA_HT_ALERT" "K_HT"
## [130] "K_HT_ALERT" "COLOR_HT" "COLOR_HT_ALERT"
## [133] "SIO2_HT" "SIO2_HT_ALERT" "COM_FIELD_PH"
## [136] "COMMENT_FLD_CHEM" "COMMENT_LAB_CHEM" "COMMENT_IM_CHEM"
## [139] "INDXSAMP_CHLA" "DATECHLA" "SAMPLED_CHLA"
## [142] "CHLA" "SAMPLE_ID_CHLA" "LAB_ID_CHLA"
## [145] "CHLA_RL_ALERT" "FLAG_FLD_CHLA" "COMMENT_FLD_CHLA"
## [148] "FLAG_LAB_CHLA" "COMMENT_LAB_CHLA" "CHLA_HT_ALERT"
## [151] "CHLA_HT" "DATESECCHI" "SAMPLED_SECCHI"
## [154] "SECMEAN" "CLEAR_TO_BOTTOM" "COMMENT_SECCHI"
## [157] "VISIT_ID" "FLAG_SECCHI"
names(data_lake_info)
## [1] "SITE_ID" "VISIT_NO" "SAMPLED"
## [4] "DATE_COL" "REPEAT" "SITE_TYPE"
## [7] "LAKE_SAMP" "TNT" "LON_DD"
## [10] "LAT_DD" "ALBERS_X" "ALBERS_Y"
## [13] "FLD_LON_DD" "FLD_LAT_DD" "FLD_SRC"
## [16] "FLD_FLAG" "ST" "STATE_NAME"
## [19] "CNTYNAME" "EPA_REG" "NHDNAME"
## [22] "LAKENAME" "AREA_CAT7" "NESLAKE"
## [25] "NESLAKE_ID" "STRATUM" "PANEL"
## [28] "DSGN_CAT" "MDCATY" "WGT"
## [31] "WGT_NLA" "ADJWGT_CAT" "URBAN"
## [34] "WSA_ECO3" "WSA_ECO9" "ECO_LEV_3"
## [37] "ECO_L3_NAM" "NUT_REG" "NUTREG_NAME"
## [40] "ECO_NUTA" "LAKE_ORIGIN" "ECO3_X_ORIGIN"
## [43] "REF_CLUSTER" "REFCLUS_NAME" "RT_NLA"
## [46] "REF_NUTR" "AREA_HA" "SIZE_CLASS"
## [49] "LAKEAREA" "LAKEPERIM" "SLD"
## [52] "DEPTH_X" "DEPTHMAX" "ELEV_PT"
## [55] "HUC_2" "HUC_8" "REACHCODE"
## [58] "COM_ID" "INDEX_SAMP" "STATUS_VER"
## [61] "STATUS_FLD" "STATUS_DSK" "PERM_WATER"
## [64] "NON_SALINE" "SRFC_AREA" "METER_DEEP"
## [67] "OPEN_WATER" "AQUACULTUR" "DISPOSAL"
## [70] "SEWAGE" "EVAPORATE" "PHYS_ACCES"
## [73] "FLAG_INFO" "COMMENT_INFO" "SAMPLED_PROFILE"
## [76] "SAMPLED_SECCHI" "SAMPLED_ASSESS" "SAMPLED_PHAB"
## [79] "INDXSAMP_PHAB" "SAMPLED_CHEM" "INDXSAMP_CHEM"
## [82] "SAMPLED_CHLA" "INDXSAMP_CHLA" "SAMPLED_ZOOP"
## [85] "INDXSAMP_ZOOP" "SAMPLED_PHYT" "INDXSAMP_PHYT"
## [88] "SAMPLED_CORE" "INDXSAMP_CORE" "SAMPLED_INF"
## [91] "INDXSAMP_INF" "SAMPLED_ENTE" "INDXSAMP_ENTE"
## [94] "SAMPLED_MICR" "INDXSAMP_MICR" "SAMPLED_SDHG"
## [97] "INDXSAMP_SDHG" "VISIT_ID" "FID_1"
names(data_profile)
## [1] "SITE_ID" "YEAR" "VISIT_NO"
## [4] "DATE_PROFILE" "SAMPLED_PROFILE" "DEPTH"
## [7] "METALIMNION" "TEMP_FIELD" "DO_FIELD"
## [10] "PH_FIELD" "COND_FIELD" "FLAG_PROFILE"
## [13] "COMMENT_PROFILE"
You can see there actually are a lot of possible predictors in this data set. We’ll limit our models for now to some of the most common variables that are collected in lake surveys.
Let’s also look at the size of these data sets
dim(data_water_qual)
## [1] 1326 158
dim(data_lake_info)
## [1] 1252 99
dim(data_profile)
## [1] 12670 13
Now, let’s look at our response variable: chlorophylla a.
par(mfrow=c(1,2))
hist(data_water_qual$CHLA)
hist(log(data_water_qual$CHLA))
summary(data_water_qual$CHLA)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.067 2.867 7.700 30.120 26.350 936.000 5
It looks like there are some pretty extreme values of chlorophyll a.
Now we’re going to combine all of those data sets into one. First, we need to make a unique identifier for the lake and survey date.
data_water_qual$ID <- paste(data_water_qual$SITE_ID, data_water_qual$DATE_COL, sep="_")
data_lake_info$ID <- paste(data_lake_info$SITE_ID, data_lake_info$DATE_COL, sep="_")
data_profile$ID <- paste(data_profile$SITE_ID, data_profile$DATE_PROFILE, sep="_")
If you inspected data_profile
closer, you would notice that it has an observation for every depth so let’s subset to just get the surface measurements. This way we have just a single observation of dissolved oxygen and temperature for each lake.
data_profile <- subset(data_profile, data_profile$DEPTH ==0)
Now, let’s merge data frames based on the column ID
.
data_merge <- merge(data_water_qual, data_lake_info, by=c("ID")) # merge the first two
data_merge <- merge(data_merge, data_profile, by=c("ID")) # then merge the result to the third
Let’s see how many observations we end up with:
dim(data_merge)
## [1] 1253 271
Remove any observations that are missing CHLA
data_merge <- data_merge[!rowSums(is.na(data_merge["CHLA"])), ]
Determine the Julian day of the year. strftime()
is a function to convert date formats. We want to go from month/day/year to the day of the year, this way it can be a neat and tidy predictor in our model.
data_merge$DATE <- strftime(as.Date(data_merge$DATE_COL.x, "%m/%d/%Y"), format = "%j")
data_merge$DATE <- as.numeric(data_merge$DATE)
We’re going to train the model on a portion of the data and then test it on the remainder. Use the createDataPartition()
function the package caret
. Use the argument p
to Split the data into 75% of the observations for training the model and the rest for testing.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=data_merge$CHLA, p=0.75, list=FALSE) # use 75% of data to train to the model
training <- data_merge[inTrain,]
testing <- data_merge[-inTrain,]
Before we get into a more complicated analyis let’s check out some of the relationships between each of the single predictors and chlorophyll a.
library(ggplot2)
predict_vars <- c("SECMEAN","PH_LAB","COND","PTL","TEMP_FIELD","DO_FIELD","DEPTHMAX","LAKEAREA","SLD","DATE")
for(i in unique(predict_vars)){
plots <- ggplot(training,aes_string(y="CHLA",x=i)) + geom_point(alpha=0.4)
plots <- plots + scale_x_log10() + scale_y_log10()
print(plots)
}
## Warning: Removed 105 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).
It looks like some of the predictors have a pretty linear relationship with chlorophyll a. Those will probably be good predictors.
Now we’re actually going to fit a Random Forest. We’ll use the package caret
again. caret
is a nice wrapper package for many machine learning techniques. There are lots of othe packages for random forests and caret
will actually call randomForest
directly.
library(caret)
set.seed(12321)
modFit_RF <- train(CHLA ~ SECMEAN + PH_LAB + COND + PTL + TEMP_FIELD + DO_FIELD + DEPTHMAX +
LAKEAREA + SLD + DATE,
method="rf",
data=training)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
finMod_RF <- modFit_RF$finalModel
print(modFit_RF)
## Random Forest
##
## 938 samples
## 271 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 799, 799, 799, 799, 799, 799, ...
##
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 48.24719 0.5047018 8.591854 0.07093048
## 6 50.65210 0.4713491 7.577021 0.08100372
## 10 52.33093 0.4520417 8.178208 0.08681661
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
We just tuned the model to pick the optimal parametr for mtry
which is the number of variables randomly sampled to use at each split in the tree. We did this by bootstraping our data (i.e., resampling with replacement), trying different values for mtry
, and finding the optimal value by minimizing the RMSE (root mean square error, a measure of accuracy).
Let’s look at the results from the final model:
print(finMod_RF)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 2597.869
## % Var explained: 51.48
And here are the most important variables:
finMod_RF$importance
## IncNodePurity
## SECMEAN 943640.1
## PH_LAB 458455.8
## COND 301595.2
## PTL 695665.2
## TEMP_FIELD 194740.3
## DO_FIELD 376788.9
## DEPTHMAX 435968.7
## LAKEAREA 169286.3
## SLD 201833.5
## DATE 156315.0
As expected, the water clarity (average secchi depth or SECMEAN
) and nutrients (total phosphorus or PTL
) are important predictors of chlorophyll a. It wasn’t obvious from the bivariate plots, but it looks like dissolved oxygen (DO_FIELD
) is also an important predictor.
Let’s see how well the Random Forest did. First, we need to remove the test cases that have missing predictors
cols <- c("SECMEAN","PH_LAB","COND","PTL","TEMP_FIELD","DO_FIELD","DEPTHMAX","LAKEAREA","SLD","DATE")
testing <- testing[!rowSums(is.na(testing[cols])), ]
Make predictions and add them to to the testing data frame
testing$pred_RF <- predict(modFit_RF, testing)
Calculate the root mean square error (RMSE) on the testing data
sqrt(sum((testing$pred_RF - testing$CHLA)^2))
## [1] 799.5624
Plot it
plot_RF_log <- ggplot(testing, aes(x=CHLA, y=pred_RF)) + geom_point(size=2)
plot_RF_log <- plot_RF_log + geom_abline(slope=1,intercept=0,size=1,colour="red")
plot_RF_log <- plot_RF_log + xlab("Observed chl a") + ylab("Predicted chl a")
plot_RF_log <- plot_RF_log + theme_classic(base_size=18)
plot_RF_log <- plot_RF_log + expand_limits(x=0, y=0)
plot_RF_log <- plot_RF_log + scale_x_log10(expand=c(0,0)) + scale_y_log10(expand=c(0,0))
plot_RF_log
Not bad. Not great. Most observations are along the 1:1 line in red which would mean the predictions match the observations. Notice, there definitely are some cases where our predictions are way off from observed values. It also looks like there is a tendency for the random forest to predict values that are higher than observed.
While this data set is not huge (>1000 observations), fitting Random Forests on large data sets can be slow. Therefore, it’s often helpful to fit your model using all of the cores on your machine (by default R will only use one core).
There are several packages which allow for parallel processing in R. These include doSNOW
, doMC
, foreach
, and many others. For this example, I will use doSNOW
.
library(caret)
library(doSNOW)
cl <- makeCluster(4, type="SOCK") # set-up the cluster
registerDoSNOW(cl)
# fit the model
modFit_RF_par <- train(CHLA ~ SECMEAN + PH_LAB + COND + PTL + TEMP_FIELD + DO_FIELD +
DEPTHMAX + LAKEAREA + SLD + DATE,
method="rf",
data=training)
stopCluster(cl)
While 1200 or so observations might be pretty big for ecology, data sets can get a lot bigger! R
doesn’t necessarily do a great job at dealing with huge data sets (e.g., 100,000s of rows). Two common packages used to jump this hurdle are bigmemory
and ff
. They both allow large datasets to be stored on the disk, rather than in RAM.
Here’s what it would look like if you wanted to repeat the data inport step with ff
:
library(ff)
URL <- "http://water.epa.gov/type/lakes/assessmonitor/lakessurvey/upload/NLA2007_WaterQuality_20091123.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
data_water_qual <- read.table.ffdf(file=textConnection(X), FUN="read.csv")