Topics

In this sesssion we’ll be covering:

Remember from last session…

Working with R Data Frames

  • We’ll begin by loading the chicago_air data frame from last session
library(region5air)
data(chicago_air)

We always want to make sure our data looks the way it is supposed to before we begin working with it.

Remember, the best way to take a quick look at the first few rows of a data frame is to use the head() function

data(chicago_air)
head(chicago_air)  
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 3 2013-01-03 0.021   28  0.17     1       5
## 4 2013-01-04 0.028   18  0.62     1       6
## 5 2013-01-05 0.025   26  0.48     1       7
## 6 2013-01-06 0.026   36  0.47     1       1

You can specify the number of lines to display by using the n = parameter

head(chicago_air, n = 3)
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 3 2013-01-03 0.021   28  0.17     1       5

You can also look at the bottom of the data frame by using tail()

tail(chicago_air)
##           date ozone temp solar month weekday
## 360 2013-12-26 0.026   NA  0.41    12       5
## 361 2013-12-27 0.021   NA  0.62    12       6
## 362 2013-12-28 0.026   NA  0.61    12       7
## 363 2013-12-29 0.029   NA  0.08    12       1
## 364 2013-12-30 0.024   NA  0.44    12       2
## 365 2013-12-31 0.021   NA  0.49    12       3

The table function is helpful for summarizing your data by counts and the plot() and hist() functions allow you to quickly visualize the data

table(chicago_air$ozone)  ##Summarizes by counts
plot(chicago_air$ozone)  # Quick plot of data
hist(chicago_air$ozone)  #Like a historgram plot except no binning occurs
## 
## 0.004 0.008  0.01 0.011 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 
##     1     1     1     1     1     3     6     4     5     3     3     6 
## 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 
##    11    10    12    12    12    11     6    13    12     8     5     6 
## 0.033 0.034 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 
##    12     8    13     8     8     8    11     6     9     4     4     7 
## 0.045 0.046 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 
##     6     4     5     6     5     7     6     5     4     5     6     3 
## 0.057 0.058 0.059  0.06 0.061 0.062 0.064 0.065 0.066 0.067 0.068 0.069 
##     3     3     3     2     1     2     2     1     2     1     1     2 
## 0.074 0.078 0.081 
##     1     1     1


Indexing

Now we may want to view just a small subset of the data. Say you just want to do something to a particular row or column of the dataset. R can subset vectors or data frames based on their location or index value. An index value is just like reading coordinates on a map. However, it is important to remember in R the index is [rows,columns] Below is an example of how you access a particular value in a data frame based on its index.

chicago_air[5,3] ## This should grab the value associated with the fifth row and the third column
## [1] 26

Let’s look at our View function to see if this matches up with our dataframe.

View(chicago_air)

We can also access data from a vector using the same indexing idea. In this case, you don’t need the comma to separate the rows and columns since you are accessing one dimensional data.

x <- c(1, 3, 2, 7, 25.3, 6)
x[5]  # This will access the fifth element in the vector
## [1] 25.3
  • Now that we understand indexing we can subset the chicago_air data frame by using the [ function, i.e. brackets

Subsetting using indexing

To get one row of the data frame, specify the row number you would like in the brackets, on the left side of the comma. By leaving the column value blank, it returns all the columns associated with row number 1.

chicago_air[1, ]
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3

Remember, the convention is [rows, columns]

If you want more than one row, you can supply a vector of row numbers

chicago_air[c(1, 2, 5), ] #Accesses the 1, 2 and 5th rows of data
##         date ozone temp solar month weekday
## 1 2013-01-01 0.032   17  0.65     1       3
## 2 2013-01-02 0.020   15  0.61     1       4
## 5 2013-01-05 0.025   26  0.48     1       7

To get a column from the data frame, specify the column number in the brackets, to the right of the comma. By leaving the row value blank, you are telling it to return all rows associated with column 1.

chicago_air[, 1]

You can obtain more than one column by supplying a vector of column numbers

chicago_air[, c(3, 4, 6)]

Column names can also be used which is really handy if you are more familiar with the name of a column than its location.

chicago_air[, "solar"]

Or a vector of column names

chicago_air[, c("ozone", "temp", "month")]

Both rows and columns can be specified. Here we are employing the colon operator : This operator means from:to (e.g a:b)

chicago_air[1:5, 3:5]  # Returns the values associated with the first 5 rows of data and the third through fifth columns.
##   temp solar month
## 1   17  0.65     1
## 2   15  0.61     1
## 3   28  0.17     1
## 4   18  0.62     1
## 5   26  0.48     1

Logical Operators

  • You can also subset a data frame by using logical expressions
  • The logical expression is used to specify rows that you want to keep or discard
chicago_air[(logical expression), ]

Reference Table of Logical Operators

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
! x not x
x & y x AND y
x | y x OR y
  • Another helpful tool when subsetting is the complete.cases function.
  • This function allows us to only look at data where observations for all columns are complete.
chi_air.complete <- chicago_air[complete.cases(chicago_air),] 

##Here I have indicated that we should subset or filter the chicago_air dataframe by only those rows where all the data is present.

aq <- chi_air.complete # For convenience, and to save space, I rename the data frame


Now back to subsetting with logical operators…

  • Now we are only working with our aq dataframe which has a complete observation for each date and time.
  • Let’s say we only want rows in this data frame where ozone was above 70 ppb (.070 ppm)
aq[(aq$ozone > .070), ]  # This returns all the days with readings above .070 ppm
##           date ozone temp solar month weekday
## 134 2013-05-14 0.081   74  1.40     5       3
## 171 2013-06-20 0.074   80  1.35     6       5
## 252 2013-09-09 0.078   83  1.11     9       2
oz.viol <- aq[(aq$ozone > .070), ]  # You must assign this a variable if you want to save that information.
  • So, the way the logical vector subsets the data frame is by providing a vector that indicates if a row should be kept (TRUE) or dropped (FALSE)

If we wanted all of the days in the 7th month, we could use ==

aq[(aq$month == 7), ]
##           date ozone temp solar month weekday
## 207 2013-07-26 0.029   69  0.29     7       6
## 208 2013-07-27 0.021   61  0.89     7       7
## 209 2013-07-28 0.023   60  1.15     7       1
## 210 2013-07-29 0.036   66  1.19     7       2
## 211 2013-07-30 0.025   73  0.92     7       3
## 212 2013-07-31 0.043   67  0.50     7       4

Or if we want all days except the 6th day, use !=

aq[(aq$weekday != 6), ]  #Excludes all data associated with the 6th day of the week
  • We can combine logical conditions with & (and operator)
  • If we wanted only rows where the temperature was between 80 and 85 (including those numbers)
aq[(aq$temp >= 80 & aq$temp <= 85), ]
##           date ozone temp solar month weekday
## 121 2013-05-01 0.068   80  1.36     5       4
## 140 2013-05-20 0.069   81  1.38     5       2
## 150 2013-05-30 0.038   82  1.09     5       5
## 168 2013-06-17 0.062   84  1.44     6       2
## 171 2013-06-20 0.074   80  1.35     6       5
## 172 2013-06-21 0.048   81  0.60     6       6
## 174 2013-06-23 0.058   82  1.35     6       1
## 178 2013-06-27 0.050   83  1.35     6       5
## 219 2013-08-07 0.056   82  1.05     8       4
## 232 2013-08-20 0.061   82  1.16     8       3
## 233 2013-08-21 0.058   85  1.16     8       4
## 237 2013-08-25 0.060   84  1.20     8       1
## 238 2013-08-26 0.053   84  1.01     8       2
## 250 2013-09-07 0.050   83  1.11     9       7
## 252 2013-09-09 0.078   83  1.11     9       2
## 254 2013-09-11 0.045   81  0.95     9       4
  • We can also use the or operator, |
  • If we only wanted rows on days 3 or 5
aq[(aq$weekday == 3 | aq$weekday == 5),]
##           date ozone temp solar month weekday
## 1   2013-01-01 0.032   17  0.65     1       3
## 3   2013-01-03 0.021   28  0.17     1       5
## 8   2013-01-08 0.021   30  0.39     1       3
## 10  2013-01-10 0.024   33  0.42     1       5
## 15  2013-01-15 0.017   19  0.66     1       3
## 17  2013-01-17 0.034   33  0.69     1       5
## 22  2013-01-22 0.026    1  0.73     1       3
## 36  2013-02-05 0.026   26  0.21     2       3
## 38  2013-02-07 0.025   33  0.09     2       5
## 43  2013-02-12 0.028   28  0.30     2       3
## 45  2013-02-14 0.033   38  0.28     2       5
## 50  2013-02-19 0.029   26  0.54     2       3
## 52  2013-02-21 0.036   28  0.17     2       5
## 57  2013-02-26 0.035   35  0.29     2       3
## 59  2013-02-28 0.033   34  0.58     2       5
## 64  2013-03-05 0.037   33  0.22     3       3
## 66  2013-03-07 0.039   34  1.03     3       5
## 71  2013-03-12 0.039   34  1.05     3       3
## 73  2013-03-14 0.031   35  0.76     3       5
## 78  2013-03-19 0.042   24  1.19     3       3
## 80  2013-03-21 0.034   26  1.20     3       5
## 85  2013-03-26 0.039   38  1.13     3       3
## 87  2013-03-28 0.044   44  1.23     3       5
## 92  2013-04-02 0.045   37  1.28     4       3
## 94  2013-04-04 0.055   49  1.27     4       5
## 99  2013-04-09 0.035   40  0.83     4       3
## 101 2013-04-11 0.032   40  0.14     4       5
## 106 2013-04-16 0.041   47  1.34     4       3
## 108 2013-04-18 0.035   51  0.16     4       5
## 113 2013-04-23 0.036   58  0.30     4       3
## 115 2013-04-25 0.046   45  0.83     4       5
## 120 2013-04-30 0.059   74  1.37     4       3
## 122 2013-05-02 0.029   45  1.22     5       5
## 127 2013-05-07 0.059   61  1.38     5       3
## 129 2013-05-09 0.045   64  0.61     5       5
## 134 2013-05-14 0.081   74  1.40     5       3
## 136 2013-05-16 0.057   73  1.41     5       5
## 141 2013-05-21 0.055   74  1.27     5       3
## 143 2013-05-23 0.033   51  1.09     5       5
## 148 2013-05-28 0.033   72  0.93     5       3
## 150 2013-05-30 0.038   82  1.09     5       5
## 157 2013-06-06 0.040   55  0.87     6       5
## 162 2013-06-11 0.066   78  1.23     6       3
## 164 2013-06-13 0.040   64  1.46     6       5
## 169 2013-06-18 0.033   57  1.43     6       3
## 171 2013-06-20 0.074   80  1.35     6       5
## 176 2013-06-25 0.041   72  0.98     6       3
## 178 2013-06-27 0.050   83  1.35     6       5
## 211 2013-07-30 0.025   73  0.92     7       3
## 213 2013-08-01 0.041   75  1.32     8       5
## 218 2013-08-06 0.044   73  1.00     8       3
## 220 2013-08-08 0.035   68  1.29     8       5
## 225 2013-08-13 0.033   65  1.34     8       3
## 227 2013-08-15 0.042   72  1.21     8       5
## 232 2013-08-20 0.061   82  1.16     8       3
## 234 2013-08-22 0.040   74  0.41     8       5
## 239 2013-08-27 0.057   89  1.18     8       3
## 241 2013-08-29 0.047   77  1.23     8       5
## 246 2013-09-03 0.035   67  1.23     9       3
## 248 2013-09-05 0.036   69  1.23     9       5
## 253 2013-09-10 0.059   91  1.15     9       3
## 255 2013-09-12 0.034   72  1.18     9       5
## 260 2013-09-17 0.039   64  0.97     9       3
## 262 2013-09-19 0.035   70  0.87     9       5
## 267 2013-09-24 0.043   64  1.14     9       3
## 269 2013-09-26 0.050   66  1.14     9       5
## 274 2013-10-01 0.044   68  0.92    10       3
## 276 2013-10-03 0.022   66  0.46    10       5

Subsetting using the subset() function

  • You can also use the subset() function.
  • The first argument in the function is the data frame and the second argument is the logical expression e.g. subset(x, logical expression)
high.temp <- subset(aq, temp > 90)  # Using subset you can refer to the name of the column without using the $ operator.  To save the information you need to assign it a variable name

By using the select = parameter you can specify which columns to keep

aq.sub <- subset(aq, temp > 90, select = c(ozone, temp))

You can save the subsetted data to a new variable so that you only work with that in the future


Sorting data in base R

You can sort data using the order function. The default is ascending, but by using the negative symbol in front of the variable you can sort descending, as well. Let’s sort the chicago_air dataset by ozone.

sort.oz <- chicago_air[order(chicago_air$ozone),] # sort the dataset by ozone in ascending order

sort.oz.sol <- with(chicago_air, chicago_air[order(ozone, solar),])  # sort the data first by ozone and then by solar radiation

sort.oz.sol2 <- with(chicago_air, chicago_air[order(ozone, -solar),])   # sort the data first by ozone in ascending order, then by solar radiation in descending order

Now that you have created several data frames by chopping up the data, it’s a good time to learn about recombining data using cbind() and rbind()

  • The rbind function requires that there are an equal number of columns.
  • The rbind.fill function from the plyr package gets around this by filling the remaining columns with NAs
  • cbind will allow you to combine vectors, matrices or data frames by columns
combo <- rbind(sort.oz.sol, sort.oz.sol2)

install.packages("plyr")
library(plyr)
rbind.fill(oz.viol, aq.sub)


cbind(aq.sub,oz.viol)  # the row numbers of the datasets need to be equal or the data will be recycled from the shorter dataset.
##         date ozone temp solar month weekday
## 1 2013-05-14 0.081   74  1.40     5       3
## 2 2013-06-20 0.074   80  1.35     6       5
## 3 2013-09-09 0.078   83  1.11     9       2
## 4       <NA> 0.059   91    NA    NA      NA
##   ozone temp       date ozone temp solar month weekday
## 1 0.059   91 2013-05-14 0.081   74  1.40     5       3
## 2 0.059   91 2013-06-20 0.074   80  1.35     6       5
## 3 0.059   91 2013-09-09 0.078   83  1.11     9       2

Now that we know how to manipulate datasets in R, I would like to take a not so brief detour and talk about working with dates in R

Importing data with dates

  • Remember back to our discussion of importing data from Excel .csv files or AQS .txt files?
  • Most of the data we care about as environmental data analysts has dates and times associated with it. These have to be properly imported into R so we can do analysis and create plots.

  • Dates can be imported from character, numeric, POSIXlt, and POSIXct formats using the as.Date function from the base package.
  • If you want to create dates in R please refer to the help(POSIXlt) and help(POSIXct) help files.

  • However, today we will talk about converting existing dates to the correct date/time format since that is often the case with environmental data.

  • If your data were exported from Excel (or other file formats), they may be either in numeric or character format.


Importing Dates from Character Format

  • If your dates are stored as characters, you simply need to provide the as.Date function with your vector of dates and the format they are currently stored in.
  • The possible date formats are listed in a table below.

  • For example, “05/27/84″ is in the format %m/%d/%y, while “May 27 1984″ is in the format%B %d %Y.
  • To import those dates, you would provide your dates, their format (if you don’t specify a format as.Date will try %Y-%m-%d and then %Y/%m/%d), and the timezone they are in.

Reference Table of Date Formats in R

Symbol Meaning Example
%d day as a number (0-31) 01-31
%a abbreviated weekday Mon
%A unabbreviated weekday Monday
%m month (0-12) 00-12
%b abbreviated month Jan
%B unabbreviated month January
%y 2-digit year 01
%Y 4-digit year 2001
%j decimal (julian) day of the year 1-365

When we imported our chicago_air sample dataset, our dates came in as a characters.

str(chicago_air$date)
##  chr [1:365] "2013-01-01" "2013-01-02" "2013-01-03" ...

We need to change them to R dates using the as.Date function below

chicago_air$date <- as.Date(chicago_air$date, format = '%Y-%m-%d', tz = "America/Chicago")

str(chicago_air$date)
##  Date[1:365], format: "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...

Now let’s look at some more complicated dates and times as they might look in an hourly AQS file.

data(airdata)
head(airdata)
  • Let’s examine the structure of our dates to see if they came in as a character data type.
  • If your dates look something like this 12312007 and R treats them as numeric, it is best to change these to character format using x <- as.character() before converting them to dates.
str(airdata$date)
##  chr [1:367595] "20141231T0100-0600" "20141231T0100-0600" ...
site1 <- airdata[airdata$site == 840181270011, ]  ##Let's subset this dataframe down to 1 site for ease of use

head(site1)
##                site data_status action_code           datetime parameter
## 326199 840181270011           0          10 20141231T0100-0600     62101
## 326205 840181270011           0          10 20141230T0100-0600     62101
## 326211 840181270011           0          10 20141229T0100-0600     62101
## 326217 840181270011           0          10 20141228T0100-0600     62101
## 326223 840181270011           0          10 20141227T0100-0600     62101
## 326229 840181270011           0          10 20141226T0100-0600     62101
##        duration frequency value unit qc poc      lat       lon GISDatum
## 326199       60         0    13   15  0   1 41.63411 -87.10148    WGS84
## 326205       60         0    27   15  0   1 41.63411 -87.10148    WGS84
## 326211       60         0    32   15  0   1 41.63411 -87.10148    WGS84
## 326217       60         0    32   15  0   1 41.63411 -87.10148    WGS84
## 326223       60         0    46   15  0   1 41.63411 -87.10148    WGS84
## 326229       60         0    37   15  0   1 41.63411 -87.10148    WGS84
##        elev method_code mpc mpc_value uncertainty qualifiers
## 326199  196          41   1       -60          NA       <NA>
## 326205  196          41   1       -60          NA       <NA>
## 326211  196          41   1       -60          NA       <NA>
## 326217  196          41   1       -60          NA       <NA>
## 326223  196          41   1       -60          NA       <NA>
## 326229  196          41   1       -60          NA       <NA>
  • Suppose we want to retain both the date and time information in this dataset.
  • We need to use a function called strptime to deal with times or date-time combos.
  • To use the strptime function we need to get our date and time information into a more reasonable looking format than what they came in (e.g. 20141231T0100-0600)
    • We will need the paste and substr functions to pull out the elements that we need before we can use them with strptime
datetime <- '20141231T0100-0600'  ##This is the ugly version we have now

date <- substr(datetime, start = 1, stop = 8)  ## Let's grab the first 8 characters

time <- substr(datetime, start = 10, stop = 13)  ## And the last 4 characters

new.datetime <- paste(date, time, sep=" ")  #Now we will paste them together into a new date-time 

new.datetime
## [1] "20141231 0100"

Now, we can feed this new data into the strptime() function. This function requires the following arguments (similar to as.Date): date-time values, date-time format, and the timezone. The strptime function requires the date to be in character format first.

final.datetime  <- strptime(new.datetime, format="%Y%m%d %H%M", tz="America/Chicago")

final.datetime  # The AQS date-time values are now recognized as a date and time by R and can be used in future time-series analysis and plotting.
## [1] "2014-12-31 01:00:00 CST"
  • as.POSIXlt and as.POSIXct are two other functions that will convert many data types to dates, and is the preferred method for use with data that will be fed into the openair package. The ‘ct’ in as.POSIXct stands for calendar time and stores the number of seconds since the origin. as.POSIXlt stores date-time information as a list of time attributes such as “hour” and “mon”. This might be useful if you want to extract those attributes later. Unless you are very familiar with extracting information from lists, you may want to avoid this format. Luckily, if you are using the as.POSIXct date-times in openair it will automatically pull out these attributes for you.

  • as.POSIXct requires the same information as strptime as you can see in the example below.

final.datetime <- as.POSIXct(new.datetime, format="%Y%m%d %H%M", tz="America/Chicago")
final.datetime
## [1] "2014-12-31 01:00:00 CST"

Reference Table of Time Formats in R

Symbol Meaning Example
%H hour as a number (24 hour) 01-24
%M minute as a number 01-60
%S second as a number 01-60
%I hour as a number (12 hour) 01-12
  • Here are the paste and substr functions nested within one statement and combined with the strptime function to use on our site1 dataset created above:
site1$newdatetime <- strptime(paste(substr(site1$datetime, 1, 8), substr(site1$datetime, 10, 13), sep=" "), format="%Y%m%d %H%M", tz="America/Chicago")
head(site1)
##                site data_status action_code           datetime parameter
## 326199 840181270011           0          10 20141231T0100-0600     62101
## 326205 840181270011           0          10 20141230T0100-0600     62101
## 326211 840181270011           0          10 20141229T0100-0600     62101
## 326217 840181270011           0          10 20141228T0100-0600     62101
## 326223 840181270011           0          10 20141227T0100-0600     62101
## 326229 840181270011           0          10 20141226T0100-0600     62101
##        duration frequency value unit qc poc      lat       lon GISDatum
## 326199       60         0    13   15  0   1 41.63411 -87.10148    WGS84
## 326205       60         0    27   15  0   1 41.63411 -87.10148    WGS84
## 326211       60         0    32   15  0   1 41.63411 -87.10148    WGS84
## 326217       60         0    32   15  0   1 41.63411 -87.10148    WGS84
## 326223       60         0    46   15  0   1 41.63411 -87.10148    WGS84
## 326229       60         0    37   15  0   1 41.63411 -87.10148    WGS84
##        elev method_code mpc mpc_value uncertainty qualifiers
## 326199  196          41   1       -60          NA       <NA>
## 326205  196          41   1       -60          NA       <NA>
## 326211  196          41   1       -60          NA       <NA>
## 326217  196          41   1       -60          NA       <NA>
## 326223  196          41   1       -60          NA       <NA>
## 326229  196          41   1       -60          NA       <NA>
##                newdatetime
## 326199 2014-12-31 01:00:00
## 326205 2014-12-30 01:00:00
## 326211 2014-12-29 01:00:00
## 326217 2014-12-28 01:00:00
## 326223 2014-12-27 01:00:00
## 326229 2014-12-26 01:00:00
#By putting site1$newdatetime to the left of the arrow I have done a shortcut for creating a new column of data in my dataset with the name "newdatetime"

Importing Dates from Numeric Format

If you are importing data from Excel, you may have dates that are in an Excel numeric format. We can still use as.Date to import these, we simply need to know the origin date for Excel’s date numbering system, and provide that to the as.Date function.

For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. For Excel on Mac, the origin date is January 1, 1904.

new.date.format <- as.Date(42274.00, origin="1899-12-30",tz="GMT")  ##This code should work for any Excel dates generated from a Windows machine.
new.date.format
## [1] "2015-09-27"

Changing Date Formats

If you would like to use dates in a format other than the standard %Y-%m-%d, you can do that using the format function from the base package. Be sure you save it with a variable name or the new format won’t be saved. This is simply changing the way the dates appear.

date.format2 <- format(new.date.format,"%d %b %Y")
date.format2
## [1] "27 Sep 2015"


Important: There is a package by Hadley Wickham called lubridate which makes the conversion of date-time data much more intuitive. Once you have a handle on the date-time functionality in base R, I would recommend looking into this package. Below is a short example from the package vignette which shows how to assign R date-times using lubridate.

library(lubridate)
arrive <- ymd_hms("2011-06-04 12:00:00", tz = "Pacific/Auckland")
arrive
## [1] "2011-06-04 12:00:00 NZST"

Now let’s try some exercises to test our understanding of subsetting and sorting data and working with dates in R.

Exercise 3

http://rpubs.com/kfrost14/Ex3