Our aim

The main goal of this project was to integrate a Microsoft Access database into R using the RODBC package. While this appeared very straightforward, we ran into several complex issues. Here, we document the problems, solutions we found, and some results we were able to accomplish.

What is an Access database?

Microsoft Access

Access is a database management system using Microsoft’s Jet Database Engine Databases are created and stored in a unique format, making them not readily accessible to everyone

Getting Access data into R Introduction

There are two methods to connect with Access databases 1. Using the ODBC (Open DataBase Connectivity) facility 2. Using the DBI (DataBase Interface) package in R ODBC allows connection to a database to be opened Then, SQL is needed to actually import database tables into R

ODBC

ODBC allows access to a variety of database systems via a common interface ODBC requires a specific driver to be installed (in our case, a Microsoft Access Driver which is already installed with the installation of Office) which serves as a translation layer between ODBC and the database system

Interfacing with a Microsoft Access database through Rstudio

Our project was investigating using the R package ‘RODBC’ to establish a connection to a Microsoft Access Database (.mdb , .accdb) through RStudio, in order to query the database using SQL language within R.

First you have to installl the package

https://CRAN.R-project.org/package=RODBC

install.packages("RODBC")

The documentation for establishing the database connection with the RODBC package is as follows

library("RODBC") #load package
db<-file.path("C:/path/to/your/database.accdb") #connect database.
channel<-odbcConnectAccess2007(db) #internal RODBC function
dataSetName<-sqlFetch(channel,"TableName") #read particular table from Access database file.

Another option included specifying the Microsoft Driver

## Set up driver info and database path
DRIVERINFO <- "Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
MDBPATH <- "C:/path/to/your/database.accdb"
PATH <- paste0(DRIVERINFO, "DBQ=", MDBPATH)

## Establish connection
channel <- odbcDriverConnect(PATH)

For this to work, you must have a ODBC Microsoft Access Driver installed on your computer

If you have Microsoft Access installed, the driver should already be installed. However, you can also install the drivers (32-bit or 64-bit) separately: https://www.microsoft.com/en-us/download/details.aspx?id=54920

There are two ways that information about the driver and the location of the database system is provided to ODBC

The first way is to provide all information from within R each time the connection to the database system is established. To do this, you will need the file path of the database (EX. C:.accdb), and the driver information, as in the two examples above.

Establishing a DSN

Alternatively, you can establish a DSN (Data Source Name) so that the DNS describes the connection and only the DSN needs to be provided each time a connection is made. This is a workaround for establishing the Driver Connection on Windows, that seems to work ~ 50% of the time

Source: https://www.r-bloggers.com/2013/01/getting-access-data-into-r/

Step 1: Go to Control Panel

Step 2: Search ‘ODBC data sources’. Click on Set up ODBC data sources (either 32-bit or 64-bit depending on your version of Microsoft. In our case, we are using 64-bit)

Step 3: Under ‘User DSN’, you will see a list of the User Data Sources. Click ‘Add…’

Step 4: Select Microsoft Access Driver(.mdb, .accdb). Then click ‘Finish’

Step 5: Name your Data Source Name (DSN) and add a description, then click ‘OK’

Step 6: Now, under ‘User DSN”’you should see your newly named DSN. Click ’OK’ again to close the window.

Now you can access and import Access Database Tables into R using the DSN name you assigned in the previous steps

#load package
library(RODBC)
#establish connection to DSN that you created manually
#we named our DSN 'ERMNSQL'
con <- odbcConnect("ERMNSQL") #only the name is needed, not the full path

View all the tables in the database

sqlTables(con, tableType = "TABLE")$TABLE_NAME
##  [1] "tbl_Discharge"               "tbl_Discharge_alternate"    
##  [3] "tbl_Events"                  "tbl_Fish"                   
##  [5] "tbl_Fish_Count"              "tbl_Habitat"                
##  [7] "tbl_Habitat_LWD"             "tbl_Habitat_LWD_Transect"   
##  [9] "tbl_Habitat_Transect"        "tbl_Habitat_Transect_Fields"
## [11] "tbl_Sites"                   "tlu_Condition_Category_10"  
## [13] "tlu_Condition_Category_20"   "tlu_Contacts"               
## [15] "tlu_Cover_Percent"           "tlu_Diamater"               
## [17] "tlu_Riparian_Vegetation"     "tlu_Species"                
## [19] "tlu_Species_BMI"             "tlu_Substrate_Size_Class"   
## [21] "tlu_Water_Clarity"           "tlu_Water_Color"            
## [23] "tlu_Were_Fish_Captured"

View column names in a table

sqlColumns(con, "tbl_Events")$COLUMN_NAME
##  [1] "Event_ID"                      "Site_ID"                      
##  [3] "Start_Date"                    "Visit_Number"                 
##  [5] "Observer_1"                    "Observer_2"                   
##  [7] "Observer_3"                    "WQ_Time"                      
##  [9] "Temp_C"                        "SpecficConductance"           
## [11] "DO_mgL"                        "DO_Percent"                   
## [13] "pH"                            "WaterColor"                   
## [15] "Secchi_Depth_cm"               "Discharge_Notes"              
## [17] "Event_Notes"                   "Event_Metadata"               
## [19] "DischargeQ_Calculated_m3_sec"  "DischargeQ_Calculated_ft3_sec"
## [21] "Verified_Date"                 "Verified_By"                  
## [23] "Certified_Date"                "Certified_By"                 
## [25] "SampleID"
sqlColumns(con, "tbl_Sites")$COLUMN_NAME
##  [1] "Site_ID"       "Type"          "GRTS_Num"      "Status"       
##  [5] "LAT_Y"         "LON_X"         "Site_Name"     "Panel"        
##  [9] "Unit_Code"     "State"         "Reach_Width"   "Reach_Length" 
## [13] "Site_Metadata" "IsSiteActive"
sqlColumns(con, "tbl_Fish_Count")$COLUMN_NAME
## [1] "Fish_Count_ID"          "Fish_ID"                "Scientific_Name"       
## [4] "Adult_Count"            "YoY_Count"              "Total_Count_Calculated"
## [7] "Photo_File_Notes"

View specific tables

sqlFetch(con, "tbl_Events", max = 10)
##                                Event_ID   Site_ID Start_Date Visit_Number
## 1  804B8A3B-45C5-422B-A089-CCC82F48A178 BLUE.1001 2022-05-18            1
## 2  2C6ABDD1-339C-426F-B7C3-E1A3E2BD1086 BLUE.1002 2022-06-10            1
## 3  171826A0-006A-4215-B3D0-CB5FCFE75DE7 BLUE.1003 2022-06-09            1
## 4  608845CD-366A-4BC6-88D9-E8F76F7C72AA DEWA.3001 2022-06-16            1
## 5  E8597FCC-87F1-40EB-A76F-129D4853F069 DEWA.3001 2022-06-24            2
## 6  4D1303D7-A5E0-4595-B37B-48D12617CD9D DEWA.3001 2022-06-27            3
## 7  E4085646-B39D-45DF-A1F9-50B5882F5AEC DEWA.3002 2022-06-17            1
## 8  1AB58A1C-E839-4F2F-BC91-6CB7A56112CE DEWA.3002 2022-06-24            2
## 9  8B8A2B41-B2E5-4047-B513-9FD41AEDF55C DEWA.3002 2022-06-27            3
## 10 A4CC461E-8690-4E01-8BB0-385790A87999 DEWA.3003 2022-06-20            1
##    Observer_1 Observer_2 Observer_3             WQ_Time Temp_C
## 1        Stum    Haglund     Nulton 1899-12-30 10:03:00  12.53
## 2        Stum    Haglund     Nulton 1899-12-30 09:48:00  15.14
## 3        Stum    Haglund     Nulton 1899-12-30 10:07:00  18.12
## 4        Stum     Nulton    Haglund 1899-12-30 11:09:00  14.03
## 5        Stum    Haglund     Nulton 1899-12-30 09:57:00  13.10
## 6        Stum    Haglund     Nulton 1899-12-30 12:59:00  15.79
## 7        Stum    Haglund     Nulton 1899-12-30 08:13:00  16.77
## 8        Stum    Haglund     Nulton 1899-12-30 12:48:00  15.86
## 9        Stum    Haglund     Nulton 1899-12-30 15:32:00  18.36
## 10       Stum    Haglund     Nulton 1899-12-30 09:40:00  12.54
##    SpecficConductance DO_mgL DO_Percent   pH WaterColor Secchi_Depth_cm
## 1               102.4  10.29       97.9 7.57      Clear              44
## 2               172.6   9.69       97.6 7.59      Clear             120
## 3               105.4   8.94       95.9 7.64      Clear             120
## 4                32.8  10.32      101.5 6.92      Brown              97
## 5                39.0  10.31       99.3 6.97      Clear             120
## 6                40.5   9.54       97.5 6.63      Clear             120
## 7               223.9   9.56       99.8 7.23      Clear             120
## 8               236.7   9.93      101.6 7.44      Clear             120
## 9               242.8   9.20       99.2 7.12      Clear             120
## 10              178.9  10.19       96.9 6.80      Clear             120
##                                                                          Discharge_Notes
## 1                                                                                   <NA>
## 2                                        taken in pool ~ 10m. Upstream of reach endpoint
## 3                                                                                   <NA>
## 4  Taken ~ 15m upstream of marked midpoint in tailout of deep pool between two LWD piles
## 5                                                                            taken at T5
## 6                                                                                   <NA>
## 7                    Taken ~ 5m upstream of midpoint before cutbank pool on River Right.
## 8                                                                                   <NA>
## 9                                                                                   <NA>
## 10                                        Taken ~5m above reach start @ large beech tree
##                                            Event_Notes Event_Metadata
## 1                                                 <NA>             NA
## 2  Additional observers: Jennifer Flippin & Haley King             NA
## 3  Additional observers: Jennifer Flippin & Haley King             NA
## 4                                                 <NA>             NA
## 5                                                 <NA>             NA
## 6                                                 <NA>             NA
## 7                                                 <NA>             NA
## 8                                                 <NA>             NA
## 9                                                 <NA>             NA
## 10                                                <NA>             NA
##    DischargeQ_Calculated_m3_sec DischargeQ_Calculated_ft3_sec Verified_Date
## 1                             0                             0    2022-09-26
## 2                             0                             0    2022-09-26
## 3                             0                             0    2022-09-26
## 4                             0                             0    2022-10-12
## 5                             0                             0    2022-10-12
## 6                             0                             0    2022-10-12
## 7                             0                             0    2022-10-12
## 8                             0                             0    2022-10-12
## 9                             0                             0    2022-10-12
## 10                            0                             0    2022-10-12
##    Verified_By Certified_Date Certified_By SampleID
## 1       Thomas     2022-10-28         Stum        8
## 2       Thomas     2022-10-28         Stum        9
## 3       Thomas     2022-10-28         Stum       10
## 4       Thomas           <NA>         <NA>       45
## 5       Thomas           <NA>         <NA>       46
## 6       Thomas           <NA>         <NA>       47
## 7       Thomas           <NA>         <NA>       48
## 8       Thomas           <NA>         <NA>       49
## 9       Thomas           <NA>         <NA>       50
## 10      Thomas           <NA>         <NA>       51

Building queries to select specific columns from a table

WQqry <-  "SELECT Event_ID, Site_ID, Start_Date, Visit_Number, WQ_Time, Temp_C,
        SpecficConductance, DO_mgl, DO_Percent, pH, WaterColor, Secchi_Depth_cm 
        FROM  tbl_Events" #embed the SQL query language in an object

#use the RODBC command 'sqlQuery' to pass your query object to the connected database
SiteWQ <- sqlQuery(con, WQqry)

head(SiteWQ)
##                               Event_ID   Site_ID Start_Date Visit_Number
## 1 804B8A3B-45C5-422B-A089-CCC82F48A178 BLUE.1001 2022-05-18            1
## 2 2C6ABDD1-339C-426F-B7C3-E1A3E2BD1086 BLUE.1002 2022-06-10            1
## 3 171826A0-006A-4215-B3D0-CB5FCFE75DE7 BLUE.1003 2022-06-09            1
## 4 608845CD-366A-4BC6-88D9-E8F76F7C72AA DEWA.3001 2022-06-16            1
## 5 E8597FCC-87F1-40EB-A76F-129D4853F069 DEWA.3001 2022-06-24            2
## 6 4D1303D7-A5E0-4595-B37B-48D12617CD9D DEWA.3001 2022-06-27            3
##               WQ_Time Temp_C SpecficConductance DO_mgl DO_Percent   pH
## 1 1899-12-30 10:03:00  12.53              102.4  10.29       97.9 7.57
## 2 1899-12-30 09:48:00  15.14              172.6   9.69       97.6 7.59
## 3 1899-12-30 10:07:00  18.12              105.4   8.94       95.9 7.64
## 4 1899-12-30 11:09:00  14.03               32.8  10.32      101.5 6.92
## 5 1899-12-30 09:57:00  13.10               39.0  10.31       99.3 6.97
## 6 1899-12-30 12:59:00  15.79               40.5   9.54       97.5 6.63
##   WaterColor Secchi_Depth_cm
## 1      Clear              44
## 2      Clear             120
## 3      Clear             120
## 4      Brown              97
## 5      Clear             120
## 6      Clear             120
#another query from a different table
nameqry <- "SELECT Site_ID, Site_Name From tbl_Sites"
SiteNames <- sqlQuery(con, nameqry)

head(SiteNames)
##     Site_ID              Site_Name
## 1 ALPO.1001          Millstone Run
## 2 ALPO.2001          Blair Gap Run
## 3 BLUE.1001          Indian Branch
## 4 BLUE.1002         Mountain Creek
## 5 BLUE.1003 Little Bluestone River
## 6 BLUE.1004         Jarrell Branch

GROUP BY and AVG arguments

pHqry <- "SELECT Site_ID, avg(pH) AS Average_pH, max(pH) AS Max_pH, min(pH) AS Min_pH  
      FROM tbl_EVENTS
      GROUP BY Site_ID"
pHWQ <- sqlQuery(con, pHqry)
head(pHWQ)
##     Site_ID Average_pH Max_pH Min_pH
## 1 BLUE.1001   7.570000   7.57   7.57
## 2 BLUE.1002   7.590000   7.59   7.59
## 3 BLUE.1003   7.640000   7.64   7.64
## 4 DEWA.3001   6.840000   6.97   6.63
## 5 DEWA.3002   7.263333   7.44   7.12
## 6 DEWA.3003   6.830000   7.04   6.65

Unable to get JOINS to work, as syntax for sql queries is different from the DBI package

However, you can use base R ‘merge’ as a workaround

#merging the two dataframes created from the first two queries above
SiteWQ <- merge(SiteNames, SiteWQ, by="Site_ID")
head(SiteWQ)
##     Site_ID              Site_Name                             Event_ID
## 1 BLUE.1001          Indian Branch 804B8A3B-45C5-422B-A089-CCC82F48A178
## 2 BLUE.1002         Mountain Creek 2C6ABDD1-339C-426F-B7C3-E1A3E2BD1086
## 3 BLUE.1003 Little Bluestone River 171826A0-006A-4215-B3D0-CB5FCFE75DE7
## 4 DEWA.3001     Caledonia Creek 13 608845CD-366A-4BC6-88D9-E8F76F7C72AA
## 5 DEWA.3001     Caledonia Creek 13 E8597FCC-87F1-40EB-A76F-129D4853F069
## 6 DEWA.3001     Caledonia Creek 13 4D1303D7-A5E0-4595-B37B-48D12617CD9D
##   Start_Date Visit_Number             WQ_Time Temp_C SpecficConductance DO_mgl
## 1 2022-05-18            1 1899-12-30 10:03:00  12.53              102.4  10.29
## 2 2022-06-10            1 1899-12-30 09:48:00  15.14              172.6   9.69
## 3 2022-06-09            1 1899-12-30 10:07:00  18.12              105.4   8.94
## 4 2022-06-16            1 1899-12-30 11:09:00  14.03               32.8  10.32
## 5 2022-06-24            2 1899-12-30 09:57:00  13.10               39.0  10.31
## 6 2022-06-27            3 1899-12-30 12:59:00  15.79               40.5   9.54
##   DO_Percent   pH WaterColor Secchi_Depth_cm
## 1       97.9 7.57      Clear              44
## 2       97.6 7.59      Clear             120
## 3       95.9 7.64      Clear             120
## 4      101.5 6.92      Brown              97
## 5       99.3 6.97      Clear             120
## 6       97.5 6.63      Clear             120

Lessons Learned

SQL Project Roadblocks

Microsoft Access isn’t included in MS Office for macOS This means although you don’t need to have the Microsoft Access file opened to use in R, you won’t have the Microsoft Access Driver needed. You could use a virtual machine on your Mac to run window applications or dual boot Windows and macOS on your machine…

Driver installation

Instead of being able to connect to the Microsoft Access driver through a line of code as per the package documentation, we had to manually create a driver as outlined previously. Even then, only 2 of us were able to get R to connect to the DSN, as the computer is still unable to find the DSN or there is a mismatch between the Driver, the Database, and R.

Compatibility Issues

Using odbcConnectAccess() will sometimes give an error message indicating that odbcConnectAccess is only usable with 32-bit Windows. This can sometimes occur if you have 32 bit office installed on your machine and using 64-bit R… or for other unknown causes in our situation… (we got this error even though we had 64-bit R and Microsoft).

One way to potentially fix this is if you have R that has both 32-bit and 64-bit versions so you can switch to 32-bit, or download an older R version with 32-bit. The downside to switching to 32-bit is that you may come across issues such as memory issues. Then, make sure you have the 32-bit version of the Driver and proceed to establish the connection the same way as steps previously mentioned. Although, we tried this workaround and were still unable to get it to work.

Syntax differences between ‘DBI’ and ‘RODBC’

If you are familiar with arguments for SQL queries with the ‘DBI’ package, the arguments and syntax look a little different under the ‘RODBC’ package. This made it a little more challenging than we initially thought to just transfer the queries we learned to do in class.