MyRWA Water Quality Database

This notebook summarizes how to load the MyRWA water quality database into R.

Preface

Before describing how to connect to the database and retrieve data, it's useful to see where we are we going. Ultimately, water quality analysis is best performed when data have the following fields:

Project
Site
Date
Variable
Value
Units
Flag
Qualifier

For creating maps, we also want a dataframe containing the location of each site:

Site
Latitude
Longitude

With this goal in mind, I'll now describe how to connect to the database in R, retrieve the data, and conform it to these two target dataframes.

Connecting to the Database

The RODBC package provides functions for directly connecting to Access databases. Because the MyRWA database has the extension *.accdb, the odbcConnectAccess2007() function must be used instead of odbcConnectAccess(), which only connects to *.mdb files. Also note that RODBC only seems to work in the 32-bit mode of R (otherwise get error saying it can't connect). The connect function returns a channel object that is used to access the database.

library(RODBC)
library(ggplot2)

DB_PATH <- "C:/Users/Jeff/Dropbox/Research/mystic/data/myrwa_2012-08-27.accdb"
ch <- odbcConnectAccess2007(DB_PATH)

There are various functions such as sqlTables() and sqlColumns() for listing the tables and table fields of the database

sqlTables(ch, tableType = "TABLE")$TABLE_NAME

##  [1] "~TMPCLP216931"                "Activity"                    
##  [3] "Activity_Media"               "Activity_Type"               
##  [5] "ActivityImport$_ImportErrors" "Analytical_Method"           
##  [7] "baseline-station-coord"       "Characteristic"              
##  [9] "Characteristic_Nominal"       "Horizontal_Collection_Method"
## [11] "Location"                     "Location_Name"               
## [13] "Locations$_ImportErrors"      "LocationTable$_ImportErrors" 
## [15] "Methods"                      "Methods_Old"                 
## [17] "Monitor"                      "Monitoring_Location_Type"    
## [19] "Municipality"                 "Project"                     
## [21] "Qualifier_Lookup"             "Reference"                   
## [23] "Result"                       "Result_backup_021012"        
## [25] "Result_Flag_Lookup"           "Result_Status"               
## [27] "Results$_ImportErrors"        "sample"                      
## [29] "Sample_Fraction"              "Source"                      
## [31] "SSOs"                         "test"                        
## [33] "Thermometer"                  "Visit"                       
## [35] "Visit_Monitor"                "Water_Type"

sqlColumns(ch, "Result")[, c("COLUMN_NAME", "TYPE_NAME", "REMARKS")]

##             COLUMN_NAME TYPE_NAME         REMARKS
## 1                    ID   COUNTER            <NA>
## 2           activity_id   INTEGER            <NA>
## 3            monitor_id   INTEGER            <NA>
## 4     characteristic_id   INTEGER            <NA>
## 5     result_measurenum    DOUBLE numeric measure
## 6  result_measurenom_id   INTEGER nomeric measure
## 7           result_unit   VARCHAR            <NA>
## 8     result_value_type   VARCHAR            <NA>
## 9           result_flag   VARCHAR            <NA>
## 10        result_method   VARCHAR            <NA>
## 11   result_labsamplenr   INTEGER            <NA>
## 12     result_status_id   INTEGER            <NA>
## 13      result_comments  LONGCHAR            <NA>
## 14       thermometer_id   INTEGER            <NA>
## 15            qualifier   VARCHAR            <NA>
## 16           new_method   INTEGER            <NA>
## 17      lower_detection    DOUBLE            <NA>
## 18      upper_detection    DOUBLE            <NA>
## 19               Field1   VARCHAR            <NA>

Fetching Data

There are three methods for fetching data from the database via R: 1) fetch tables as dataframes and merge, 2) submit SQL statement, 3) fetch existing query.

Method 1: Fetch tables and merge

The tables are fetched individually using the sqlFetch() function which returns each table as a dataframe. The dataframes are then joined within R using the merge() function. Note that only two dataframes can be merged at a time, so with multiple tables this requires multiple calls to merge(). The join keys must be specified individually for each table using the by.x and by.y arguments, and all.x=T is used to prevent loss of results from the dataframes if there are no corresponding records in the other tables.

tblResult <- sqlFetch(ch, "Result")
tblCharacteristic <- sqlFetch(ch, "Characteristic")
tblActivity <- sqlFetch(ch, "Activity")
tblVisit <- sqlFetch(ch, "Visit")
tblLocation <- sqlFetch(ch, "Location")
tblProject <- sqlFetch(ch, "Project")

The resulting dataframes can then be joined using merge().

df.fetch <- merge(tblResult[, c("ID", "activity_id", "characteristic_id", 
    "result_measurenum", "result_measurenom_id", "result_unit", "result_flag", 
    "qualifier")], tblCharacteristic[, c("ID", "characteristic_name", "characteristic_type")], 
    by.x = "characteristic_id", by.y = "ID", all.x = T)
df.fetch <- merge(df.fetch, tblActivity[, c("ID", "visit_id", "activity_type_id", 
    "activity_media_id", "activity_sampledate")], by.x = "activity_id", by.y = "ID", 
    all.x = T)
df.fetch <- merge(df.fetch, tblVisit[, c("ID", "project_ID", "monitoring_location_ID", 
    "location_ID", "visit_date")], by.x = "visit_id", by.y = "ID", all.x = T)
df.fetch <- merge(df.fetch, tblLocation[, c("ID", "location_name_shortname")], 
    by.x = "location_ID", by.y = "ID", all.x = T)
df.fetch <- merge(df.fetch, tblProject[, c("ID", "project_name")], 
    by.x = "project_ID", by.y = "ID", all.x = T)

The resulting dataframe, df.fetch, now has all the desired fields but many additional fields (primary keys) that are no longer needed now that the joins are complete.

names(df.fetch)

##  [1] "project_ID"              "location_ID"            
##  [3] "visit_id"                "activity_id"            
##  [5] "characteristic_id"       "ID"                     
##  [7] "result_measurenum"       "result_measurenom_id"   
##  [9] "result_unit"             "result_flag"            
## [11] "qualifier"               "characteristic_name"    
## [13] "characteristic_type"     "activity_type_id"       
## [15] "activity_media_id"       "activity_sampledate"    
## [17] "monitoring_location_ID"  "visit_date"             
## [19] "location_name_shortname" "project_name"

df.fetch <- within(df.fetch, {
    project <- project_name
    site <- location_name_shortname
    date <- activity_sampledate
    var <- characteristic_name
    value <- result_measurenum
    units <- result_unit
    flag <- result_flag
    qualifier <- qualifier
})
df.fetch <- df.fetch[, c("project", "site", "date", "var", "value", 
    "units", "flag", "qualifier")]
head(df.fetch)

##                              project   site                date
## 1 Baseline Mystic River Watershed MA ABR006 2001-04-11 07:05:00
## 2 Baseline Mystic River Watershed MA ABR006 2006-01-11 07:04:00
## 3 Baseline Mystic River Watershed MA ABR006 2001-04-11 07:05:00
## 4 Baseline Mystic River Watershed MA ABR006 2001-05-22 06:01:00
## 5 Baseline Mystic River Watershed MA ABR006 2007-08-15 07:08:00
## 6 Baseline Mystic River Watershed MA ABR006 2007-08-15 07:08:00
##                     var    value  units flag qualifier
## 1            Phosphorus  0.04212   mg/l <NA>      <NA>
## 2            Water Odor       NA   <NA> <NA>      <NA>
## 3 Dissolved oxygen (DO)  9.35000   mg/l    *      <NA>
## 4            Phosphorus  1.97400 mmol/L    *      <NA>
## 5            Water Odor       NA   <NA> <NA>      <NA>
## 6    Temperature, water 19.50000  deg C <NA>      <NA>

Method 2: SQL Query

The other method is to use a single SQL statement to retrieve a database query using the sqlQuery() function. The easiest method for constructing the SQL statement is by creating a Query in Access using the query design tool, and then copying the resulting SQL into R.

Note that it is important to specify the correct join type in SQL queries. By default, Access will perform 'inner' joins meaning it will only select records that have values in both tables. So for example, if there is a join from Result.result_flag to Result_Flag_Lookup.ID then the results will only include records where there is some result flag. To force Access to include all results, including those with missing flags, the join must be changed to a LEFT join.

SQL <- "SELECT Project.project_name AS project, Location.location_name_shortname AS site, 
Activity.activity_sampledate AS [date], Characteristic.characteristic_name AS var, 
Result.result_measurenum AS val, Result.result_unit AS units, Result_Flag_Lookup.result_flag AS flag, 
Qualifier_Lookup.Qualifier AS qualifier
FROM Project INNER JOIN (Location RIGHT JOIN (Visit INNER JOIN (Activity INNER JOIN 
(((Characteristic RIGHT JOIN Result ON Characteristic.ID = Result.characteristic_id) 
LEFT JOIN Result_Flag_Lookup ON Result.result_flag = Result_Flag_Lookup.result_flag) LEFT JOIN 
Qualifier_Lookup ON Result.qualifier = Qualifier_Lookup.Qualifier) ON Activity.ID = Result.activity_id) 
ON Visit.ID = Activity.visit_id) ON Location.ID = Visit.monitoring_location_ID) ON Project.ID = 
Visit.project_ID;"
df.sql <- sqlQuery(ch, SQL)
names(df.sql)[which(names(df.sql)=='val')] <- 'value'
head(df.sql)

##                              project   site                date
## 1 Baseline Mystic River Watershed MA ABR049 2002-06-12 07:12:00
## 2 Baseline Mystic River Watershed MA ABR028 2002-06-12 06:12:00
## 3 Baseline Mystic River Watershed MA ABR006 2002-06-12 06:53:00
## 4 Baseline Mystic River Watershed MA UPL001 2002-06-12 07:35:00
## 5 Baseline Mystic River Watershed MA MIB001 2002-06-12 06:25:00
## 6 Baseline Mystic River Watershed MA MYR071 2002-06-12 07:08:00
##                var value     units flag qualifier
## 1 Escherichia coli   580 MPN/100ml <NA>      <NA>
## 2 Escherichia coli   530 MPN/100ml    *      <NA>
## 3 Escherichia coli   330 MPN/100ml <NA>      <NA>
## 4 Escherichia coli    80 MPN/100ml <NA>      <NA>
## 5 Escherichia coli  1745 MPN/100ml    *      <NA>
## 6 Escherichia coli   770 MPN/100ml <NA>      <NA>

Method 3: Fetch Query

If the database already has a query set up that creates the desired table, then we can use sqlFetch() to fetch the query results (same as fetching each table individually). The SQL statement from the previous method was saved as a query in the database called 'qryR'.

df.qry <- sqlFetch(ch, "qryR")
names(df.qry)[which(names(df.qry) == "val")] <- "value"
head(df.qry)

##                              project   site                date
## 1 Baseline Mystic River Watershed MA ABR049 2002-06-12 07:12:00
## 2 Baseline Mystic River Watershed MA ABR028 2002-06-12 06:12:00
## 3 Baseline Mystic River Watershed MA ABR006 2002-06-12 06:53:00
## 4 Baseline Mystic River Watershed MA UPL001 2002-06-12 07:35:00
## 5 Baseline Mystic River Watershed MA MIB001 2002-06-12 06:25:00
## 6 Baseline Mystic River Watershed MA MYR071 2002-06-12 07:08:00
##                var value     units flag qualifier
## 1 Escherichia coli   580 MPN/100ml <NA>      <NA>
## 2 Escherichia coli   530 MPN/100ml    *      <NA>
## 3 Escherichia coli   330 MPN/100ml <NA>      <NA>
## 4 Escherichia coli    80 MPN/100ml <NA>      <NA>
## 5 Escherichia coli  1745 MPN/100ml    *      <NA>
## 6 Escherichia coli   770 MPN/100ml <NA>      <NA>

Remarks

Any of these three methods can be used to load the data from the database into R in a dataframe suitable for analysis. I would prefer the third as it allows you to set up the query in Access, and then fetch that directly avoiding the need for SQL or merge statements. However, I am

Note that a major bug in all of these approaches is that the site (location_name_shortname) field is not unique, so sites with the same shortname cannot be distinguished in any of the resulting dataframes. Until this is fixed, one could use the location ID instead of the shortname as the site variable.