In this sesssion we’ll be covering:
Remember from last session…
chicago_air
data frame from last sessionlibrary(region5air)
data(chicago_air)
We always want to make sure our data looks the way it is supposed to before we begin working with it.
Remember, the best way to take a quick look at the first few rows of a data frame is to use the head()
function
data(chicago_air)
head(chicago_air)
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
## 4 2013-01-04 0.028 18 0.62 1 6
## 5 2013-01-05 0.025 26 0.48 1 7
## 6 2013-01-06 0.026 36 0.47 1 1
You can specify the number of lines to display by using the n =
parameter
head(chicago_air, n = 3)
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 3 2013-01-03 0.021 28 0.17 1 5
You can also look at the bottom of the data frame by using tail()
tail(chicago_air)
## date ozone temp solar month weekday
## 360 2013-12-26 0.026 NA 0.41 12 5
## 361 2013-12-27 0.021 NA 0.62 12 6
## 362 2013-12-28 0.026 NA 0.61 12 7
## 363 2013-12-29 0.029 NA 0.08 12 1
## 364 2013-12-30 0.024 NA 0.44 12 2
## 365 2013-12-31 0.021 NA 0.49 12 3
The table
function is helpful for summarizing your data by counts and the plot()
and hist()
functions allow you to quickly visualize the data
table(chicago_air$ozone) ##Summarizes by counts
plot(chicago_air$ozone) # Quick plot of data
hist(chicago_air$ozone) #Like a historgram plot except no binning occurs
##
## 0.004 0.008 0.01 0.011 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02
## 1 1 1 1 1 3 6 4 5 3 3 6
## 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032
## 11 10 12 12 12 11 6 13 12 8 5 6
## 0.033 0.034 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044
## 12 8 13 8 8 8 11 6 9 4 4 7
## 0.045 0.046 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056
## 6 4 5 6 5 7 6 5 4 5 6 3
## 0.057 0.058 0.059 0.06 0.061 0.062 0.064 0.065 0.066 0.067 0.068 0.069
## 3 3 3 2 1 2 2 1 2 1 1 2
## 0.074 0.078 0.081
## 1 1 1
Now we may want to view just a small subset of the data. Say you just want to do something to a particular row or column of the dataset. R can subset vectors or data frames based on their location or index value. An index value is just like reading coordinates on a map. However, it is important to remember in R the index is [rows,columns]
Below is an example of how you access a particular value in a data frame based on its index.
chicago_air[5,3] ## This should grab the value associated with the fifth row and the third column
## [1] 26
Let’s look at our View function to see if this matches up with our dataframe.
View(chicago_air)
We can also access data from a vector using the same indexing idea. In this case, you don’t need the comma to separate the rows and columns since you are accessing one dimensional data.
x <- c(1, 3, 2, 7, 25.3, 6)
x[5] # This will access the fifth element in the vector
## [1] 25.3
chicago_air
data frame by using the [
function, i.e. bracketsTo get one row of the data frame, specify the row number you would like in the brackets, on the left side of the comma. By leaving the column value blank, it returns all the columns associated with row number 1.
chicago_air[1, ]
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
Remember, the convention is [rows, columns]
If you want more than one row, you can supply a vector of row numbers
chicago_air[c(1, 2, 5), ] #Accesses the 1, 2 and 5th rows of data
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 2 2013-01-02 0.020 15 0.61 1 4
## 5 2013-01-05 0.025 26 0.48 1 7
To get a column from the data frame, specify the column number in the brackets, to the right of the comma. By leaving the row value blank, you are telling it to return all rows associated with column 1.
chicago_air[, 1]
You can obtain more than one column by supplying a vector of column numbers
chicago_air[, c(3, 4, 6)]
Column names can also be used which is really handy if you are more familiar with the name of a column than its location.
chicago_air[, "solar"]
Or a vector of column names
chicago_air[, c("ozone", "temp", "month")]
Both rows and columns can be specified. Here we are employing the colon operator :
This operator means from:to (e.g a:b)
chicago_air[1:5, 3:5] # Returns the values associated with the first 5 rows of data and the third through fifth columns.
## temp solar month
## 1 17 0.65 1
## 2 15 0.61 1
## 3 28 0.17 1
## 4 18 0.62 1
## 5 26 0.48 1
chicago_air[(logical expression), ]
Operator | Description |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
! x |
not x |
x & y |
x AND y |
x | y |
x OR y |
complete.cases
function.chi_air.complete <- chicago_air[complete.cases(chicago_air),]
##Here I have indicated that we should subset or filter the chicago_air dataframe by only those rows where all the data is present.
aq <- chi_air.complete # For convenience, and to save space, I rename the data frame
aq
dataframe which has a complete observation for each date and time.aq[(aq$ozone > .070), ] # This returns all the days with readings above .070 ppm
## date ozone temp solar month weekday
## 134 2013-05-14 0.081 74 1.40 5 3
## 171 2013-06-20 0.074 80 1.35 6 5
## 252 2013-09-09 0.078 83 1.11 9 2
oz.viol <- aq[(aq$ozone > .070), ] # You must assign this a variable if you want to save that information.
If we wanted all of the days in the 7th month, we could use ==
aq[(aq$month == 7), ]
## date ozone temp solar month weekday
## 207 2013-07-26 0.029 69 0.29 7 6
## 208 2013-07-27 0.021 61 0.89 7 7
## 209 2013-07-28 0.023 60 1.15 7 1
## 210 2013-07-29 0.036 66 1.19 7 2
## 211 2013-07-30 0.025 73 0.92 7 3
## 212 2013-07-31 0.043 67 0.50 7 4
Or if we want all days except the 6th day, use !=
aq[(aq$weekday != 6), ] #Excludes all data associated with the 6th day of the week
&
(and operator)aq[(aq$temp >= 80 & aq$temp <= 85), ]
## date ozone temp solar month weekday
## 121 2013-05-01 0.068 80 1.36 5 4
## 140 2013-05-20 0.069 81 1.38 5 2
## 150 2013-05-30 0.038 82 1.09 5 5
## 168 2013-06-17 0.062 84 1.44 6 2
## 171 2013-06-20 0.074 80 1.35 6 5
## 172 2013-06-21 0.048 81 0.60 6 6
## 174 2013-06-23 0.058 82 1.35 6 1
## 178 2013-06-27 0.050 83 1.35 6 5
## 219 2013-08-07 0.056 82 1.05 8 4
## 232 2013-08-20 0.061 82 1.16 8 3
## 233 2013-08-21 0.058 85 1.16 8 4
## 237 2013-08-25 0.060 84 1.20 8 1
## 238 2013-08-26 0.053 84 1.01 8 2
## 250 2013-09-07 0.050 83 1.11 9 7
## 252 2013-09-09 0.078 83 1.11 9 2
## 254 2013-09-11 0.045 81 0.95 9 4
|
aq[(aq$weekday == 3 | aq$weekday == 5),]
## date ozone temp solar month weekday
## 1 2013-01-01 0.032 17 0.65 1 3
## 3 2013-01-03 0.021 28 0.17 1 5
## 8 2013-01-08 0.021 30 0.39 1 3
## 10 2013-01-10 0.024 33 0.42 1 5
## 15 2013-01-15 0.017 19 0.66 1 3
## 17 2013-01-17 0.034 33 0.69 1 5
## 22 2013-01-22 0.026 1 0.73 1 3
## 36 2013-02-05 0.026 26 0.21 2 3
## 38 2013-02-07 0.025 33 0.09 2 5
## 43 2013-02-12 0.028 28 0.30 2 3
## 45 2013-02-14 0.033 38 0.28 2 5
## 50 2013-02-19 0.029 26 0.54 2 3
## 52 2013-02-21 0.036 28 0.17 2 5
## 57 2013-02-26 0.035 35 0.29 2 3
## 59 2013-02-28 0.033 34 0.58 2 5
## 64 2013-03-05 0.037 33 0.22 3 3
## 66 2013-03-07 0.039 34 1.03 3 5
## 71 2013-03-12 0.039 34 1.05 3 3
## 73 2013-03-14 0.031 35 0.76 3 5
## 78 2013-03-19 0.042 24 1.19 3 3
## 80 2013-03-21 0.034 26 1.20 3 5
## 85 2013-03-26 0.039 38 1.13 3 3
## 87 2013-03-28 0.044 44 1.23 3 5
## 92 2013-04-02 0.045 37 1.28 4 3
## 94 2013-04-04 0.055 49 1.27 4 5
## 99 2013-04-09 0.035 40 0.83 4 3
## 101 2013-04-11 0.032 40 0.14 4 5
## 106 2013-04-16 0.041 47 1.34 4 3
## 108 2013-04-18 0.035 51 0.16 4 5
## 113 2013-04-23 0.036 58 0.30 4 3
## 115 2013-04-25 0.046 45 0.83 4 5
## 120 2013-04-30 0.059 74 1.37 4 3
## 122 2013-05-02 0.029 45 1.22 5 5
## 127 2013-05-07 0.059 61 1.38 5 3
## 129 2013-05-09 0.045 64 0.61 5 5
## 134 2013-05-14 0.081 74 1.40 5 3
## 136 2013-05-16 0.057 73 1.41 5 5
## 141 2013-05-21 0.055 74 1.27 5 3
## 143 2013-05-23 0.033 51 1.09 5 5
## 148 2013-05-28 0.033 72 0.93 5 3
## 150 2013-05-30 0.038 82 1.09 5 5
## 157 2013-06-06 0.040 55 0.87 6 5
## 162 2013-06-11 0.066 78 1.23 6 3
## 164 2013-06-13 0.040 64 1.46 6 5
## 169 2013-06-18 0.033 57 1.43 6 3
## 171 2013-06-20 0.074 80 1.35 6 5
## 176 2013-06-25 0.041 72 0.98 6 3
## 178 2013-06-27 0.050 83 1.35 6 5
## 211 2013-07-30 0.025 73 0.92 7 3
## 213 2013-08-01 0.041 75 1.32 8 5
## 218 2013-08-06 0.044 73 1.00 8 3
## 220 2013-08-08 0.035 68 1.29 8 5
## 225 2013-08-13 0.033 65 1.34 8 3
## 227 2013-08-15 0.042 72 1.21 8 5
## 232 2013-08-20 0.061 82 1.16 8 3
## 234 2013-08-22 0.040 74 0.41 8 5
## 239 2013-08-27 0.057 89 1.18 8 3
## 241 2013-08-29 0.047 77 1.23 8 5
## 246 2013-09-03 0.035 67 1.23 9 3
## 248 2013-09-05 0.036 69 1.23 9 5
## 253 2013-09-10 0.059 91 1.15 9 3
## 255 2013-09-12 0.034 72 1.18 9 5
## 260 2013-09-17 0.039 64 0.97 9 3
## 262 2013-09-19 0.035 70 0.87 9 5
## 267 2013-09-24 0.043 64 1.14 9 3
## 269 2013-09-26 0.050 66 1.14 9 5
## 274 2013-10-01 0.044 68 0.92 10 3
## 276 2013-10-03 0.022 66 0.46 10 5
subset()
functionsubset()
function.subset(x, logical expression)
high.temp <- subset(aq, temp > 90) # Using subset you can refer to the name of the column without using the $ operator. To save the information you need to assign it a variable name
By using the select =
parameter you can specify which columns to keep
aq.sub <- subset(aq, temp > 90, select = c(ozone, temp))
You can save the subsetted data to a new variable so that you only work with that in the future
You can sort data using the order function. The default is ascending, but by using the negative symbol in front of the variable you can sort descending, as well. Let’s sort the chicago_air
dataset by ozone.
sort.oz <- chicago_air[order(chicago_air$ozone),] # sort the dataset by ozone in ascending order
sort.oz.sol <- with(chicago_air, chicago_air[order(ozone, solar),]) # sort the data first by ozone and then by solar radiation
sort.oz.sol2 <- with(chicago_air, chicago_air[order(ozone, -solar),]) # sort the data first by ozone in ascending order, then by solar radiation in descending order
Now that you have created several data frames by chopping up the data, it’s a good time to learn about recombining data using cbind()
and rbind()
rbind.fill
function from the plyr
package gets around this by filling the remaining columns with NAscbind
will allow you to combine vectors, matrices or data frames by columnscombo <- rbind(sort.oz.sol, sort.oz.sol2)
install.packages("plyr")
library(plyr)
rbind.fill(oz.viol, aq.sub)
cbind(aq.sub,oz.viol) # the row numbers of the datasets need to be equal or the data will be recycled from the shorter dataset.
## date ozone temp solar month weekday
## 1 2013-05-14 0.081 74 1.40 5 3
## 2 2013-06-20 0.074 80 1.35 6 5
## 3 2013-09-09 0.078 83 1.11 9 2
## 4 <NA> 0.059 91 NA NA NA
## ozone temp date ozone temp solar month weekday
## 1 0.059 91 2013-05-14 0.081 74 1.40 5 3
## 2 0.059 91 2013-06-20 0.074 80 1.35 6 5
## 3 0.059 91 2013-09-09 0.078 83 1.11 9 2
Most of the data we care about as environmental data analysts has dates and times associated with it. These have to be properly imported into R so we can do analysis and create plots.
as.Date
function from the base
package.If you want to create dates in R please refer to the help(POSIXlt)
and help(POSIXct)
help files.
However, today we will talk about converting existing dates to the correct date/time format since that is often the case with environmental data.
If your data were exported from Excel (or other file formats), they may be either in numeric or character format.
as.Date
function with your vector of dates and the format they are currently stored in.The possible date formats are listed in a table below.
%m/%d/%y
, while “May 27 1984″ is in the format%B %d %Y
.To import those dates, you would provide your dates, their format (if you don’t specify a format as.Date
will try %Y-%m-%d
and then %Y/%m/%d
), and the timezone they are in.
Symbol | Meaning | Example |
---|---|---|
%d | day as a number (0-31) | 01-31 |
%a | abbreviated weekday | Mon |
%A | unabbreviated weekday | Monday |
%m | month (0-12) | 00-12 |
%b | abbreviated month | Jan |
%B | unabbreviated month | January |
%y | 2-digit year | 01 |
%Y | 4-digit year | 2001 |
%j | decimal (julian) day of the year | 1-365 |
When we imported our chicago_air
sample dataset, our dates came in as a characters.
str(chicago_air$date)
## chr [1:365] "2013-01-01" "2013-01-02" "2013-01-03" ...
We need to change them to R dates using the as.Date
function below
chicago_air$date <- as.Date(chicago_air$date, format = '%Y-%m-%d', tz = "America/Chicago")
str(chicago_air$date)
## Date[1:365], format: "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" ...
Now let’s look at some more complicated dates and times as they might look in an hourly AQS file.
data(airdata)
head(airdata)
12312007
and R treats them as numeric, it is best to change these to character format using x <- as.character()
before converting them to dates.str(airdata$date)
## chr [1:367595] "20141231T0100-0600" "20141231T0100-0600" ...
site1 <- airdata[airdata$site == 840181270011, ] ##Let's subset this dataframe down to 1 site for ease of use
head(site1)
## site data_status action_code datetime parameter
## 326199 840181270011 0 10 20141231T0100-0600 62101
## 326205 840181270011 0 10 20141230T0100-0600 62101
## 326211 840181270011 0 10 20141229T0100-0600 62101
## 326217 840181270011 0 10 20141228T0100-0600 62101
## 326223 840181270011 0 10 20141227T0100-0600 62101
## 326229 840181270011 0 10 20141226T0100-0600 62101
## duration frequency value unit qc poc lat lon GISDatum
## 326199 60 0 13 15 0 1 41.63411 -87.10148 WGS84
## 326205 60 0 27 15 0 1 41.63411 -87.10148 WGS84
## 326211 60 0 32 15 0 1 41.63411 -87.10148 WGS84
## 326217 60 0 32 15 0 1 41.63411 -87.10148 WGS84
## 326223 60 0 46 15 0 1 41.63411 -87.10148 WGS84
## 326229 60 0 37 15 0 1 41.63411 -87.10148 WGS84
## elev method_code mpc mpc_value uncertainty qualifiers
## 326199 196 41 1 -60 NA <NA>
## 326205 196 41 1 -60 NA <NA>
## 326211 196 41 1 -60 NA <NA>
## 326217 196 41 1 -60 NA <NA>
## 326223 196 41 1 -60 NA <NA>
## 326229 196 41 1 -60 NA <NA>
strptime
to deal with times or date-time combos.strptime
function we need to get our date and time information into a more reasonable looking format than what they came in (e.g. 20141231T0100-0600
)
paste
and substr
functions to pull out the elements that we need before we can use them with strptime
datetime <- '20141231T0100-0600' ##This is the ugly version we have now
date <- substr(datetime, start = 1, stop = 8) ## Let's grab the first 8 characters
time <- substr(datetime, start = 10, stop = 13) ## And the last 4 characters
new.datetime <- paste(date, time, sep=" ") #Now we will paste them together into a new date-time
new.datetime
## [1] "20141231 0100"
Now, we can feed this new data into the strptime()
function. This function requires the following arguments (similar to as.Date
): date-time values, date-time format, and the timezone. The strptime function requires the date to be in character format first.
final.datetime <- strptime(new.datetime, format="%Y%m%d %H%M", tz="America/Chicago")
final.datetime # The AQS date-time values are now recognized as a date and time by R and can be used in future time-series analysis and plotting.
## [1] "2014-12-31 01:00:00 CST"
as.POSIXlt
and as.POSIXct
are two other functions that will convert many data types to dates, and is the preferred method for use with data that will be fed into the openair
package. The ‘ct’ in as.POSIXct
stands for calendar time and stores the number of seconds since the origin. as.POSIXlt
stores date-time information as a list of time attributes such as “hour” and “mon”. This might be useful if you want to extract those attributes later. Unless you are very familiar with extracting information from lists, you may want to avoid this format. Luckily, if you are using the as.POSIXct
date-times in openair
it will automatically pull out these attributes for you.
as.POSIXct
requires the same information as strptime as you can see in the example below.
final.datetime <- as.POSIXct(new.datetime, format="%Y%m%d %H%M", tz="America/Chicago")
final.datetime
## [1] "2014-12-31 01:00:00 CST"
Symbol | Meaning | Example |
---|---|---|
%H | hour as a number (24 hour) | 01-24 |
%M | minute as a number | 01-60 |
%S | second as a number | 01-60 |
%I | hour as a number (12 hour) | 01-12 |
paste
and substr
functions nested within one statement and combined with the strptime
function to use on our site1
dataset created above:site1$newdatetime <- strptime(paste(substr(site1$datetime, 1, 8), substr(site1$datetime, 10, 13), sep=" "), format="%Y%m%d %H%M", tz="America/Chicago")
head(site1)
## site data_status action_code datetime parameter
## 326199 840181270011 0 10 20141231T0100-0600 62101
## 326205 840181270011 0 10 20141230T0100-0600 62101
## 326211 840181270011 0 10 20141229T0100-0600 62101
## 326217 840181270011 0 10 20141228T0100-0600 62101
## 326223 840181270011 0 10 20141227T0100-0600 62101
## 326229 840181270011 0 10 20141226T0100-0600 62101
## duration frequency value unit qc poc lat lon GISDatum
## 326199 60 0 13 15 0 1 41.63411 -87.10148 WGS84
## 326205 60 0 27 15 0 1 41.63411 -87.10148 WGS84
## 326211 60 0 32 15 0 1 41.63411 -87.10148 WGS84
## 326217 60 0 32 15 0 1 41.63411 -87.10148 WGS84
## 326223 60 0 46 15 0 1 41.63411 -87.10148 WGS84
## 326229 60 0 37 15 0 1 41.63411 -87.10148 WGS84
## elev method_code mpc mpc_value uncertainty qualifiers
## 326199 196 41 1 -60 NA <NA>
## 326205 196 41 1 -60 NA <NA>
## 326211 196 41 1 -60 NA <NA>
## 326217 196 41 1 -60 NA <NA>
## 326223 196 41 1 -60 NA <NA>
## 326229 196 41 1 -60 NA <NA>
## newdatetime
## 326199 2014-12-31 01:00:00
## 326205 2014-12-30 01:00:00
## 326211 2014-12-29 01:00:00
## 326217 2014-12-28 01:00:00
## 326223 2014-12-27 01:00:00
## 326229 2014-12-26 01:00:00
#By putting site1$newdatetime to the left of the arrow I have done a shortcut for creating a new column of data in my dataset with the name "newdatetime"
If you are importing data from Excel, you may have dates that are in an Excel numeric format. We can still use as.Date
to import these, we simply need to know the origin date for Excel’s date numbering system, and provide that to the as.Date
function.
For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. For Excel on Mac, the origin date is January 1, 1904.
new.date.format <- as.Date(42274.00, origin="1899-12-30",tz="GMT") ##This code should work for any Excel dates generated from a Windows machine.
new.date.format
## [1] "2015-09-27"
If you would like to use dates in a format other than the standard %Y-%m-%d
, you can do that using the format
function from the base package. Be sure you save it with a variable name or the new format won’t be saved. This is simply changing the way the dates appear.
date.format2 <- format(new.date.format,"%d %b %Y")
date.format2
## [1] "27 Sep 2015"
lubridate
.library(lubridate)
arrive <- ymd_hms("2011-06-04 12:00:00", tz = "Pacific/Auckland")
arrive
## [1] "2011-06-04 12:00:00 NZST"
Now let’s try some exercises to test our understanding of subsetting and sorting data and working with dates in R.