Instructors: Nathan Byers and Kali Frost
Students: Please introduce yourself by telling us your name, where you work, and your experience with R
These tutorials are made available online so that you can go back and look through them when you return to work.
The exercises are designed to help you repeat the commands that are in the tutorials with just a few small changes.
Some tutorials will have material that is slightly more advanced and will be left to the student to read independently. Exercises will also have more difficult questions that will be clearly labeled as “Advanced”.
After each tutorial there will be an open lab during which students are encouraged to work on the exercises. There will also be time for discussion if there are questions about the material or the exercises.
Please feel free to speak up and help your class-mates!
Below is a table outlining the activities for each subject we will cover.
Activity | Description | Minutes |
---|---|---|
Tutorial | Lecture | 45 |
Lab | Exercises or review | 30 |
Break | Free time | 15 |
Questions | Any topic | 15 |
These tutorials and exercises are meant to introduce you to general topics and specific R functions.
We do not expect you to master all of the material in two days–there’s too much to learn!
You should expect to understand the tasks that R can accomplish and use these tutorials as resources when you go back to work.
In this session we will cover the following topics:
raqdm
Eric Bailey has written an R package that makes it very easy to get AQS data into R. The package is located on GitHub. It utilizes U.S. EPA’s AirData API. To use EPA’s AirData you must register for an account. Once you register, you will use the username and password that the EPA provides to get data into R using raqdm
.
NOTE: The code below will not work, because you need to provide your own username and password. Also, it wouldn’t be a good idea for the entire class to download large data sets all at the same time. The code below is for illustration, but we encourage you to use raqdm
in your work flow.
First, we need to install the raqdm
package from GitHub, using devtools
.
library(devtools)
install_github("FluentData/raqdm")
library(raqdm)
Once the package is installed and we load it using the library()
function, we use a function called setAQDMuser()
which will set the username and password for us. If you set save = TRUE
then your username and password will be saved for future R sessions.
setAQDMuser(user = "me@email.com", password = "secret", save = TRUE)
Once your user information has been saved, you can use the getAQDMdata()
function to make a request for a specific data set. Right now, synchronous pulls are not working in the AirData API, so you must set synchronous = FALSE
.
benz_req <- getAQDMdata(state = "18", county = "089", bdate = "20150101",
edate = "20151231", param = "45201", synchronous = FALSE)
The code above does not return a data set. A request has been sent to EPA to pull that data from AQS, and an email is sent to you when the data is ready. The request is for benzene values from Lake County, Indiana from January 1 to December 31, 2015.
Once that request has been processed and you receive an email notification, (usually no more than 1 or 2 minutes) you can run the getAQDMrequest()
function to get the data in R.
benz <- getAQDMrequest(benz_req)
head(benz)
## Latitude Longitude Datum Horizontal.Accuracy State.Code County.Code
## 1 41.60668 -87.30473 WGS84 30.36 18 89
## 2 41.60668 -87.30473 WGS84 30.36 18 89
## 3 41.60668 -87.30473 WGS84 30.36 18 89
## 4 41.60668 -87.30473 WGS84 30.36 18 89
## 5 41.60668 -87.30473 WGS84 30.36 18 89
## 6 41.60668 -87.30473 WGS84 30.36 18 89
For more information about parameters that can be used in getAQDMdata()
see the documentation on AQDM Variables
:
?`AQDM Variables`
In a previous training we covered how to import a .csv file into R using the RStudio point-and-click option. We also covered how to use the read.csv()
function to read in a .csv file. To use read.csv()
you need to supply the file path as a quoted text in the first parameter. So if you had a file called “my_data.csv” that was located on your “C:/” drive, you would use the function like this:
my_data <- read.csv("C:/my_data.csv", stringsAsFactors = FALSE)
We set stringsAsFactors = FALSE
to make sure that our data that should be a character data type remains a character data type.
It is possible to read a website like a .csv file on your computer. Reading data from a website is sometimes called web-scraping. A good example of a url that can be easily scraped using R is https://aqs.epa.gov/aqsweb/codes/data/ParametersByDesc.csv. You can find the page by going to the EPA site for AQS codes (http://www.epa.gov/aqs/aqs-code-list) then clicking
Parameters and Download Delimited Version of the Code Table.
You can actually see that the url has a .csv at the end of it. All we need to do to scrape that .csv file is to put the url as the file path in the read.csv()
function. However, as you can see on the website, there is an initial line that isn’t part of the table. We’ll need to skip that line, so we use the skip
parameter to tell the function that we want to skip the first line.
aqs_params <- read.csv("https://aqs.epa.gov/aqsweb/codes/data/ParametersByDesc.csv",
stringsAsFactors = FALSE, skip = 1)
head(aqs_params)
## Parameter.Code Standard.Unit Parameter..Alt.Desc. CAS.Num
## 1 43834 078 HFC-125 354-33-6
## 2 43837 078 None 630-20-6
## 3 43162 003 None 50-29-3
## 4 43818 078 Ethane,1,1,2,2-tetrachloro- 79-34-5
## 5 43821 078 None 76-13-1
## 6 43820 078 Ethane,1,1,2-trichloro- 79-00-5
## Parameter.Desc Conversion.Ind
## 1 1,1,1,2,2-PENTAFLUOROETHANE R
## 2 1,1,1,2-TETRACHLOROETHANE R
## 3 1,1,1-TRICHLORO-2,2-BIS (P-CHLOROPHENYL) ETHANE R
## 4 1,1,2,2-TETRACHLOROETHANE R
## 5 1,1,2-TRICHLORO-1,2,2-TRIFLUOROETHANE R
## 6 1,1,2-TRICHLOROETHANE R
This was a simple file to read from the web, but even data from more complicated websites can be scraped. Check out Hadley Wickham’s rvest
package if you would like to do more scraping: https://github.com/hadley/rvest. For advanced material on processing XML and JSON formats in web services, check out the book XML and Web Technologies for Data Sciences with R.
With R you can make connections to just about any kind of database. Below is a list of databases and the R packages that you can use to make connections and pull data.
Database | R package |
---|---|
Microsoft Access | RODBC |
Microsoft SQL Server | RSQLServer |
Oracle | ROracle |
MySQL and MariaDB | RMySQL |
PostgreSQL | RPostgreSQL |
SQLite | RSQLite |
We will illustrate how to pull a table out of a database by creating a simple SQLite database.
SQLite is an open source database that is very light and portable. Creating a database is very easy. We will use the Parameters table we downloaded from the AQS Codes site to make a database with just one table in it.
Here we load the RSQLite
package and create the database in our working directory.
library(RSQLite)
db <- dbConnect(drv = SQLite(), dbname = "Test.sqlite")
You make a connection to an SQLite database by providing the file path in the dbname
parameter. If the database doesn’t exist already then a new SQLite database will be created.
Now we will load the parameters table into the database.
dbWriteTable(conn = db, name = "Parameters", value = aqs_params)
## [1] TRUE
Now we can check to see what tables are in the database.
dbListTables(db)
## [1] "Parameters"
And we can see what fields are in the table.
dbListFields(db, "Parameters")
## [1] "Parameter.Code" "Standard.Unit" "Parameter..Alt.Desc."
## [4] "CAS.Num" "Parameter.Desc" "Conversion.Ind"
And finally, we can pull out the table and see that it’s the same as the data frame that we put into it.
aqs_params_db <- dbReadTable(db, "Parameters")
head(aqs_params_db)
head(aqs_params)
## Parameter.Code Standard.Unit Parameter..Alt.Desc. CAS.Num
## 1 43834 078 HFC-125 354-33-6
## 2 43837 078 None 630-20-6
## 3 43162 003 None 50-29-3
## 4 43818 078 Ethane,1,1,2,2-tetrachloro- 79-34-5
## 5 43821 078 None 76-13-1
## 6 43820 078 Ethane,1,1,2-trichloro- 79-00-5
## Parameter.Desc Conversion.Ind
## 1 1,1,1,2,2-PENTAFLUOROETHANE R
## 2 1,1,1,2-TETRACHLOROETHANE R
## 3 1,1,1-TRICHLORO-2,2-BIS (P-CHLOROPHENYL) ETHANE R
## 4 1,1,2,2-TETRACHLOROETHANE R
## 5 1,1,2-TRICHLORO-1,2,2-TRIFLUOROETHANE R
## 6 1,1,2-TRICHLOROETHANE R
## Parameter.Code Standard.Unit Parameter..Alt.Desc. CAS.Num
## 1 43834 078 HFC-125 354-33-6
## 2 43837 078 None 630-20-6
## 3 43162 003 None 50-29-3
## 4 43818 078 Ethane,1,1,2,2-tetrachloro- 79-34-5
## 5 43821 078 None 76-13-1
## 6 43820 078 Ethane,1,1,2-trichloro- 79-00-5
## Parameter.Desc Conversion.Ind
## 1 1,1,1,2,2-PENTAFLUOROETHANE R
## 2 1,1,1,2-TETRACHLOROETHANE R
## 3 1,1,1-TRICHLORO-2,2-BIS (P-CHLOROPHENYL) ETHANE R
## 4 1,1,2,2-TETRACHLOROETHANE R
## 5 1,1,2-TRICHLORO-1,2,2-TRIFLUOROETHANE R
## 6 1,1,2-TRICHLOROETHANE R
Now we’ll disconnect from the database and delete it from the working directory.
dbDisconnect(db)
## [1] TRUE
unlink("Test.sqlite")
Exercises for this tutorial can be found here: http://rpubs.com/NateByers/GettingDataExercises.
Sometimes you may want to automate the download of a large file that’s zipped and read it into R. If we look at the National Emissions Inventory page for 2011 we can see that there are ZIP CSV files under the heading “SCC Data Files”. Below is code that will download the Onroad zipped file to a temporary file, read the region 5 .csv file into R, and delete the temporary file on your machine.
# create a temporary file
temp <- tempfile()
# download the .zip file to the temporary file
download.file("ftp://ftp.epa.gov/EmisInventory/2011/2011neiv2_onroad_byregions.zip", temp)
# read the "onroad_5.csv" file into R--this will take a while
data <- read.csv(unz(temp, "onroad_5.csv"))
# delete the temporary file
unlink(temp)
head(data)
## state_and_county_fips_code
## 1 88180
## 2 88180
## 3 88180
## 4 88180
## 5 88181
## 6 88181
## tribal_name st_usps_cd
## 1 Shoshone-Bannock Tribes of the Fort Hall Reservation of Idaho
## 2 Shoshone-Bannock Tribes of the Fort Hall Reservation of Idaho
## 3 Shoshone-Bannock Tribes of the Fort Hall Reservation of Idaho
## 4 Shoshone-Bannock Tribes of the Fort Hall Reservation of Idaho
## 5 Coeur d'Alene Tribe of the Coeur d'Alene Reservation, Idaho
## 6 Coeur d'Alene Tribe of the Coeur d'Alene Reservation, Idaho
## county_name data_category_cd scc
## 1 NP 2103008000
## 2 NP 2302002100
## 3 NP 2102008000
## 4 NP 2302002200
## 5 NP 2102008000
## 6 NP 2302002200
Not only can you make a connection to a database with R, but you can pass SQL statements to the database and retrieve the result inside of R. Here’s an example of querying an SQLite database in R.
library(RSQLite)
# Create an SQLite database and save connection in db_cities object
db_cities <- dbConnect(drv = SQLite(), dbname = "QueryTest.sqlite")
# read in the AQS codes table for city names
aqs_cities <- read.csv("https://aqs.epa.gov/aqsweb/codes/data/CityNamesByState.csv",
stringsAsFactors = FALSE, skip = 1)
# replace . with _ in table names--otherwise you will not be able to use field names
names(aqs_cities) <- sub("\\.", "_", names(aqs_cities))
# make a Cities table in the database
dbWriteTable(db_cities, "Cities", aqs_cities)
# send a query to the database for cities in Illinois
query <- dbSendQuery(db_cities, "SELECT * FROM Cities WHERE State_Name = 'Illinois'")
# get a data frame of cities in Illinois
illinois_cities <- dbFetch(query)
head(illinois_cities)
## State_Name State_Code City_Code City_Name County_Code County_Name
## 1 Illinois 17 113 Abingdon 95 Knox
## 2 Illinois 17 230 Addieville 189 Washington
## 3 Illinois 17 243 Addison 43 DuPage
## 4 Illinois 17 295 Adeline 141 Ogle
## 5 Illinois 17 516 Albany 195 Whiteside
## 6 Illinois 17 555 Albers 27 Clinton
It’s always good practice to clear the query result and disconnect from the database.
dbClearResult(query)
dbDisconnect(db_cities)
Then we delete the database.
unlink("QueryTest.sqlite")