This notebook summarizes how to load the MyRWA water quality database into R.
Before describing how to connect to the database and retrieve data, it's useful to see where we are we going. Ultimately, water quality analysis is best performed when data have the following fields:
For creating maps, we also want a dataframe containing the location of each site:
With this goal in mind, I'll now describe how to connect to the database in R, retrieve the data, and conform it to these two target dataframes.
The RODBC package provides functions for directly connecting to Access databases. Because the MyRWA database has the extension *.accdb, the odbcConnectAccess2007() function must be used instead of odbcConnectAccess(), which only connects to *.mdb files. Also note that RODBC only seems to work in the 32-bit mode of R (otherwise get error saying it can't connect). The connect function returns a channel object that is used to access the database.
library(RODBC)
library(ggplot2)
DB_PATH <- "C:/Users/Jeff/Dropbox/Research/mystic/data/myrwa_2012-08-27.accdb"
ch <- odbcConnectAccess2007(DB_PATH)
There are various functions such as sqlTables() and sqlColumns() for listing the tables and table fields of the database
sqlTables(ch, tableType = "TABLE")$TABLE_NAME
## [1] "~TMPCLP216931" "Activity"
## [3] "Activity_Media" "Activity_Type"
## [5] "ActivityImport$_ImportErrors" "Analytical_Method"
## [7] "baseline-station-coord" "Characteristic"
## [9] "Characteristic_Nominal" "Horizontal_Collection_Method"
## [11] "Location" "Location_Name"
## [13] "Locations$_ImportErrors" "LocationTable$_ImportErrors"
## [15] "Methods" "Methods_Old"
## [17] "Monitor" "Monitoring_Location_Type"
## [19] "Municipality" "Project"
## [21] "Qualifier_Lookup" "Reference"
## [23] "Result" "Result_backup_021012"
## [25] "Result_Flag_Lookup" "Result_Status"
## [27] "Results$_ImportErrors" "sample"
## [29] "Sample_Fraction" "Source"
## [31] "SSOs" "test"
## [33] "Thermometer" "Visit"
## [35] "Visit_Monitor" "Water_Type"
sqlColumns(ch, "Result")[, c("COLUMN_NAME", "TYPE_NAME", "REMARKS")]
## COLUMN_NAME TYPE_NAME REMARKS
## 1 ID COUNTER <NA>
## 2 activity_id INTEGER <NA>
## 3 monitor_id INTEGER <NA>
## 4 characteristic_id INTEGER <NA>
## 5 result_measurenum DOUBLE numeric measure
## 6 result_measurenom_id INTEGER nomeric measure
## 7 result_unit VARCHAR <NA>
## 8 result_value_type VARCHAR <NA>
## 9 result_flag VARCHAR <NA>
## 10 result_method VARCHAR <NA>
## 11 result_labsamplenr INTEGER <NA>
## 12 result_status_id INTEGER <NA>
## 13 result_comments LONGCHAR <NA>
## 14 thermometer_id INTEGER <NA>
## 15 qualifier VARCHAR <NA>
## 16 new_method INTEGER <NA>
## 17 lower_detection DOUBLE <NA>
## 18 upper_detection DOUBLE <NA>
## 19 Field1 VARCHAR <NA>
There are three methods for fetching data from the database via R: 1) fetch tables as dataframes and merge, 2) submit SQL statement, 3) fetch existing query.
The tables are fetched individually using the sqlFetch() function which returns each table as a dataframe. The dataframes are then joined within R using the merge() function. Note that only two dataframes can be merged at a time, so with multiple tables this requires multiple calls to merge(). The join keys must be specified individually for each table using the by.x and by.y arguments, and all.x=T is used to prevent loss of results from the dataframes if there are no corresponding records in the other tables.
tblResult <- sqlFetch(ch, "Result")
tblCharacteristic <- sqlFetch(ch, "Characteristic")
tblActivity <- sqlFetch(ch, "Activity")
tblVisit <- sqlFetch(ch, "Visit")
tblLocation <- sqlFetch(ch, "Location")
tblProject <- sqlFetch(ch, "Project")
The resulting dataframes can then be joined using merge().
df.fetch <- merge(tblResult[, c("ID", "activity_id", "characteristic_id",
"result_measurenum", "result_measurenom_id", "result_unit", "result_flag",
"qualifier")], tblCharacteristic[, c("ID", "characteristic_name", "characteristic_type")],
by.x = "characteristic_id", by.y = "ID", all.x = T)
df.fetch <- merge(df.fetch, tblActivity[, c("ID", "visit_id", "activity_type_id",
"activity_media_id", "activity_sampledate")], by.x = "activity_id", by.y = "ID",
all.x = T)
df.fetch <- merge(df.fetch, tblVisit[, c("ID", "project_ID", "monitoring_location_ID",
"location_ID", "visit_date")], by.x = "visit_id", by.y = "ID", all.x = T)
df.fetch <- merge(df.fetch, tblLocation[, c("ID", "location_name_shortname")],
by.x = "location_ID", by.y = "ID", all.x = T)
df.fetch <- merge(df.fetch, tblProject[, c("ID", "project_name")],
by.x = "project_ID", by.y = "ID", all.x = T)
The resulting dataframe, df.fetch, now has all the desired fields but many additional fields (primary keys) that are no longer needed now that the joins are complete.
names(df.fetch)
## [1] "project_ID" "location_ID"
## [3] "visit_id" "activity_id"
## [5] "characteristic_id" "ID"
## [7] "result_measurenum" "result_measurenom_id"
## [9] "result_unit" "result_flag"
## [11] "qualifier" "characteristic_name"
## [13] "characteristic_type" "activity_type_id"
## [15] "activity_media_id" "activity_sampledate"
## [17] "monitoring_location_ID" "visit_date"
## [19] "location_name_shortname" "project_name"
df.fetch <- within(df.fetch, {
project <- project_name
site <- location_name_shortname
date <- activity_sampledate
var <- characteristic_name
value <- result_measurenum
units <- result_unit
flag <- result_flag
qualifier <- qualifier
})
df.fetch <- df.fetch[, c("project", "site", "date", "var", "value",
"units", "flag", "qualifier")]
head(df.fetch)
## project site date
## 1 Baseline Mystic River Watershed MA ABR006 2001-04-11 07:05:00
## 2 Baseline Mystic River Watershed MA ABR006 2006-01-11 07:04:00
## 3 Baseline Mystic River Watershed MA ABR006 2001-04-11 07:05:00
## 4 Baseline Mystic River Watershed MA ABR006 2001-05-22 06:01:00
## 5 Baseline Mystic River Watershed MA ABR006 2007-08-15 07:08:00
## 6 Baseline Mystic River Watershed MA ABR006 2007-08-15 07:08:00
## var value units flag qualifier
## 1 Phosphorus 0.04212 mg/l <NA> <NA>
## 2 Water Odor NA <NA> <NA> <NA>
## 3 Dissolved oxygen (DO) 9.35000 mg/l * <NA>
## 4 Phosphorus 1.97400 mmol/L * <NA>
## 5 Water Odor NA <NA> <NA> <NA>
## 6 Temperature, water 19.50000 deg C <NA> <NA>
The other method is to use a single SQL statement to retrieve a database query using the sqlQuery() function. The easiest method for constructing the SQL statement is by creating a Query in Access using the query design tool, and then copying the resulting SQL into R.
Note that it is important to specify the correct join type in SQL queries. By default, Access will perform 'inner' joins meaning it will only select records that have values in both tables. So for example, if there is a join from Result.result_flag to Result_Flag_Lookup.ID then the results will only include records where there is some result flag. To force Access to include all results, including those with missing flags, the join must be changed to a LEFT join.
SQL <- "SELECT Project.project_name AS project, Location.location_name_shortname AS site,
Activity.activity_sampledate AS [date], Characteristic.characteristic_name AS var,
Result.result_measurenum AS val, Result.result_unit AS units, Result_Flag_Lookup.result_flag AS flag,
Qualifier_Lookup.Qualifier AS qualifier
FROM Project INNER JOIN (Location RIGHT JOIN (Visit INNER JOIN (Activity INNER JOIN
(((Characteristic RIGHT JOIN Result ON Characteristic.ID = Result.characteristic_id)
LEFT JOIN Result_Flag_Lookup ON Result.result_flag = Result_Flag_Lookup.result_flag) LEFT JOIN
Qualifier_Lookup ON Result.qualifier = Qualifier_Lookup.Qualifier) ON Activity.ID = Result.activity_id)
ON Visit.ID = Activity.visit_id) ON Location.ID = Visit.monitoring_location_ID) ON Project.ID =
Visit.project_ID;"
df.sql <- sqlQuery(ch, SQL)
names(df.sql)[which(names(df.sql)=='val')] <- 'value'
head(df.sql)
## project site date
## 1 Baseline Mystic River Watershed MA ABR049 2002-06-12 07:12:00
## 2 Baseline Mystic River Watershed MA ABR028 2002-06-12 06:12:00
## 3 Baseline Mystic River Watershed MA ABR006 2002-06-12 06:53:00
## 4 Baseline Mystic River Watershed MA UPL001 2002-06-12 07:35:00
## 5 Baseline Mystic River Watershed MA MIB001 2002-06-12 06:25:00
## 6 Baseline Mystic River Watershed MA MYR071 2002-06-12 07:08:00
## var value units flag qualifier
## 1 Escherichia coli 580 MPN/100ml <NA> <NA>
## 2 Escherichia coli 530 MPN/100ml * <NA>
## 3 Escherichia coli 330 MPN/100ml <NA> <NA>
## 4 Escherichia coli 80 MPN/100ml <NA> <NA>
## 5 Escherichia coli 1745 MPN/100ml * <NA>
## 6 Escherichia coli 770 MPN/100ml <NA> <NA>
If the database already has a query set up that creates the desired table, then we can use sqlFetch() to fetch the query results (same as fetching each table individually). The SQL statement from the previous method was saved as a query in the database called 'qryR'.
df.qry <- sqlFetch(ch, "qryR")
names(df.qry)[which(names(df.qry) == "val")] <- "value"
head(df.qry)
## project site date
## 1 Baseline Mystic River Watershed MA ABR049 2002-06-12 07:12:00
## 2 Baseline Mystic River Watershed MA ABR028 2002-06-12 06:12:00
## 3 Baseline Mystic River Watershed MA ABR006 2002-06-12 06:53:00
## 4 Baseline Mystic River Watershed MA UPL001 2002-06-12 07:35:00
## 5 Baseline Mystic River Watershed MA MIB001 2002-06-12 06:25:00
## 6 Baseline Mystic River Watershed MA MYR071 2002-06-12 07:08:00
## var value units flag qualifier
## 1 Escherichia coli 580 MPN/100ml <NA> <NA>
## 2 Escherichia coli 530 MPN/100ml * <NA>
## 3 Escherichia coli 330 MPN/100ml <NA> <NA>
## 4 Escherichia coli 80 MPN/100ml <NA> <NA>
## 5 Escherichia coli 1745 MPN/100ml * <NA>
## 6 Escherichia coli 770 MPN/100ml <NA> <NA>
Any of these three methods can be used to load the data from the database into R in a dataframe suitable for analysis. I would prefer the third as it allows you to set up the query in Access, and then fetch that directly avoiding the need for SQL or merge statements. However, I am
Note that a major bug in all of these approaches is that the site (location_name_shortname) field is not unique, so sites with the same shortname cannot be distinguished in any of the resulting dataframes. Until this is fixed, one could use the location ID instead of the shortname as the site variable.