library(hrbrthemes)
library(gcookbook)
library(tidyverse)
# current verison
packageVersion("hrbrthemes")
## [1] '0.8.0'
Before we begin the tutorial make sure to set your working directory and call the following packages: dplyr, ggplot2, readr
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
Times and date-times are represented by the POSIXct or the POSIXlt class in R. The POSIXct format stores date and time in seconds with the number of seconds beginning at January 1, 1970, so a POSIXct date-time is essentially an single value on a timeline. Date-times prior to 1970, will be negative numbers. The POSIXlt class stores other date and time information in a list such as hour of day of week, month of year, etc. The starting year for POSIXlt data is 1900, so 2022 would be stored as year 122. Months also begin at 0, so January is stored as month 0 and February as month 1. For both POSIX classes, the timezone can be classified. While date-times stored as POSIXct and POSIXlt look similar, when you unclass them with the unclass() function, you can see the additional information stored within the POSIXlt data.
Dates without time can simply be stored as a Date class in R using the as.Date() function. Both Dates and POXIC classes need to be defined based on how they formatted. When uploading time series data into R, date and date-time data is typically uploaded as a character class and must be converted to date or time class using the as.Date(), as.POSIXct() or as.POSIXlt() functions. Below are the codes for how your dates may be commonly formatted upon import:
| Code | Description | Example |
|---|---|---|
| %d | Day of month (01-31) | 23 |
| %m | Month (01-12) | 11 |
| %b or %h | Month Abbreviated | Nov |
| %B | Full Month | November |
| %j | Day of Year (001-366) | 123 |
| %w | Weekday (0-6) | 0 (for Sunday) |
| %y | Year w/out Century | 99 |
| %Y | Year w/ Century | 1999 |
One of the easiest ways to upload data is with the read.csv() or read.delim() functions for .csv or .txt files, respectively. The readxl and readr packages are also useful for uploading excel spreadsheets or multiple sheets at once. Now, let’s upload the example data provided. We will be looking at subdaily and daily stream temperature and flow data provided in your example data folder. All of the code belows indicates which files are used for which example. You should be able to copy and paste the code in your R script and produce the same output in this document. Some plots that are included are just for reference, so the code is not included.
Once your data is uploaded, to check the format of your dataframe, the str() function is useful. You may also want to check for extraneous NA’s or columns.
#upload the example temperature data and format the date-time
example = read.csv("THL_example.csv", header = T)
#check the structure of the data and see that the DateTime column is stored as a character
str(example)
## 'data.frame': 38867 obs. of 4 variables:
## $ StationAbb: chr "THL" "THL" "THL" "THL" ...
## $ SerialNo : int 20706175 20706175 20706175 20706175 20706175 20706175 20706175 20706175 20706175 20706175 ...
## $ DateTime : chr "11/23/2020 14:00" "11/23/2020 14:05" "11/23/2020 14:10" "11/23/2020 14:15" ...
## $ Temp_C : num 10.5 10.5 10.5 10.5 10.5 ...
#change the format of the DateTime column to be a POSIXct class
example = mutate(example, DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %H:%M"))
#double check that your formatting worked
str(example)
## 'data.frame': 38867 obs. of 4 variables:
## $ StationAbb: chr "THL" "THL" "THL" "THL" ...
## $ SerialNo : int 20706175 20706175 20706175 20706175 20706175 20706175 20706175 20706175 20706175 20706175 ...
## $ DateTime : POSIXct, format: "2020-11-23 14:00:00" "2020-11-23 14:05:00" ...
## $ Temp_C : num 10.5 10.5 10.5 10.5 10.5 ...
Oftentimes loggers as well as data management software that calculates daily data will simply skip over missing dates without filling in those dates with blank data. You may want to know where data is missing for a number of reasons, so filling in missing dates with NA’s can be very useful. By filling in the dates, you can search for where data is missing or fill in those data with calculated data if appropriate. We will be doing this using the complete() function in the tidyverse. The complete() function works best when your dates are arranged in order.The example data used is the SPU_Q_2020.csv file. With a simple plot, you can see that there is missing data and without those dates filled in, the plot looks rather unsightly.
Since we know there is missing data, let’s find it and fix it. If you have multiple sites in the dataframe, this process will need to be completed in a loop or with an apply() function, which we will go over later.
#upload the example data
example2 = read.csv("SPU_Q_2020.csv", header = T)
#check to see if there are any NA's in the data
sum(is.na(example2$Q_cfs))
## [1] 0
example2 <- example2 %>%
mutate(Date = as.Date(Date, format="%m/%d/%Y")) %>% #format the date column to be a Date class
mutate(Q_cms = Q_cfs*0.028316847) %>% #create a column that stores flow in cms
arrange(Date) #arrange the dataframe by date
#because we are using daily data, we fill in the missing data by day
#and create a new dataframe to store the complete data
#add the column names for the data that you want to maintain in the data frame
#at the end of the command i.e. StationAbb and StationName
example2_complete <- complete(example2, Date = seq(Date[1], Date[length(Date)], by="day", fill="missing"),
StationAbb, StationName)
#now when we check again, we can see that 6 days have been filled in with NAs
sum(is.na(example2_complete$Q_cfs))
## [1] 40
#you can check the specific rows that have NA values
which(is.na(example2_complete$Q_cfs))
## [1] 60 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## [20] 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 159 160 161
## [39] 162 163
#extract rows with missing flow data
example2_complete[rowSums(is.na(example2_complete)) > 0,]
## # A tibble: 40 × 5
## Date StationAbb StationName Q_cfs Q_cms
## <date> <chr> <chr> <dbl> <dbl>
## 1 2020-02-29 SPU Spring Creek - Upper NA NA
## 2 2020-04-01 SPU Spring Creek - Upper NA NA
## 3 2020-04-02 SPU Spring Creek - Upper NA NA
## 4 2020-04-03 SPU Spring Creek - Upper NA NA
## 5 2020-04-04 SPU Spring Creek - Upper NA NA
## 6 2020-04-05 SPU Spring Creek - Upper NA NA
## 7 2020-04-06 SPU Spring Creek - Upper NA NA
## 8 2020-04-07 SPU Spring Creek - Upper NA NA
## 9 2020-04-08 SPU Spring Creek - Upper NA NA
## 10 2020-04-09 SPU Spring Creek - Upper NA NA
## # … with 30 more rows
Now if we plot the data, it looks much cleaner and we can clearly see where the data is missing.
Bad data can happen for a number of reasons. Your logger may have been exposed to ice or air. The concentration level you are measuring is below the detection limit and depending on which lab or logging device you are using, your data is impossibly negative. These data need to filtered out of your dataset in order to perform any kind of analyses. The way you filter data will be based on the type of data you are working with and how the data manifests. This example will take us through finding negative values and removing them. This can be done use the replace() and which() functions. A logical statement within the which() function will let you identify which data to replace. For this example < 0.
#upload the example data
example3 = read.csv("BUU_baddata.csv", header = T)
#set the date-time format and make sure data is arranged by date-time
example3 = example3 %>%
mutate(DateTime = as.POSIXct(example3$DateTime, format="%m/%d/%Y %H:%M")) %>%
arrange(DateTime)
#replace negative values with NA
example3_replace = example3 %>% mutate(Temp_C = replace(Temp_C, which(Temp_C<0), NA))
And now when the data is plotted, the negative values are removed.Just like with missing dates, you can check exactly which rows or date-times are being replaced by checking for NAs or subsetting your data.
Dataframes can be subset based on any number of specified parameters. All that is needed is a logical argument within the subset() function.
#subset for date-times in September using specifc dates
correct_date <- subset(example3_replace, DateTime >= "2020-09-01" & DateTime < "2020-10-01")
#subset for date-times in September using months()
correct_date2 <- subset(example3_replace, months(DateTime) == "September")
There are similar functions for years, days, seconds, etc.
ggplot(example2_complete, mapping = aes(x=Date, y=Q_cms)) +
geom_line(size=0.75, color = "dodgerblue") +
theme_bw() +
labs(x="Date",y="Discharge (cms)") +
theme(legend.title = element_blank()) +
theme(plot.title = element_text(size="11", face="bold",hjust = 0)) +
theme(axis.title.x = element_text(size="11")) +
theme(axis.title.y = element_text(size="11")) +
theme(axis.text.y = element_text(size=10)) +
theme(axis.text.x = element_text(size=10))
scale_x_date() can be used to determine the date limits, breaks and formatting of your x-axis.
#only show the abbreviated month as the axis label and set the breaks
ggplot(example2_complete, mapping = aes(x=Date, y=Q_cms)) +
geom_line(size=0.75, color = "dodgerblue") +
theme_bw() +
labs(x="Date",y="Discharge (cms)") +
theme(legend.title = element_blank()) +
theme(plot.title = element_text(size="11", face="bold",hjust = 0)) +
theme(axis.title.x = element_text(size="11")) +
theme(axis.title.y = element_text(size="11")) +
theme(axis.text.y = element_text(size=10)) +
theme(axis.text.x = element_text(size=10)) +
scale_x_date(breaks=as.Date(c("2020-01-01","2020-04-01","2020-07-01","2020-10-01","2020-12-31")),date_labels = "%b",
limits = as.Date(c('2020-01-01','2020-12-31')))
Loops are extremely helpful when you want to run a process multiple times. There are types of loops in R: for, while and repeat. The for loop allows you to control the number of iterations of the loop based on a set of controls that you define. This could be based on a list or sequence for example. Below are super simple examples of loops that produce the same output.
for (value in 1:5) {
print(value)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
While loops will repeat the commands within the loop, while a certain condition is met.’
value = 1
while (value <=5) {
print(value)
value = value +1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Repeat loops will simply repeat the commands within it until a specified condition is met at which point you need to specify to terminate the loop with a break statement.
value = 1
repeat {
print(value)
value = value + 1
#check for condition to break the loop
if (value > 5) {
break
}
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
When analyzing data from multiple monitoring sites, one may want to perform the same analyses or make the same figures for all of those sites. The following code will produce the same hydrograph for a number of sites and then save those plots in your working directory folder. You can create a list of monitoring sites in your data using the unique() function. Then you can run a loop that subsets your data based on that list.
qdata = read.csv("QData2020.csv", header = T)
station_list = unique(qdata$StationAbb)
qdata$Date = as.Date(qdata$Date, format = "%m/%d/%Y")
for (i in seq_along(station_list)) {
plot= ggplot(subset(qdata,qdata$StationAbb==station_list[i]),mapping = aes(x=Date, y=Q_cfs, color=TypeofQ)) +
geom_line(size=0.75) +
theme_bw() +
scale_color_manual(labels =c("2020", "Median"), values=c("deepskyblue3", "burlywood4")) + #manually set colors and legend labels
labs(x="Date",y="Discharge (cfs)") +
scale_x_date(breaks=as.Date(c("2020-01-01","2020-04-01","2020-07-01","2020-10-01","2020-12-31")),date_labels = "%b",
limits = as.Date(c('2020-01-01','2020-12-31'))) + #set date breaks, limits and labels
theme(legend.title = element_blank()) +
ggtitle(paste(station_list[i])) +
theme(plot.title = element_text(size="11", face="bold",hjust = 0)) +
theme(legend.background = element_rect(fill = "transparent")) +
theme(axis.title.x = element_text(size="11")) +
theme(axis.title.y = element_text(size="11")) +
theme(axis.text.y = element_text(size=10)) +
theme(axis.text.x = element_text(size=10)) +
theme(legend.text = element_text(size=9)) +
theme(legend.position = c(0.75,0.9)) +
theme(axis.ticks.y = element_blank()) +
theme(axis.ticks.x = element_blank()) +
theme( panel.border = element_blank()) +
theme(legend.key = element_rect(colour = "transparent" ,fill = "transparent", linetype = NULL))
#save your plot with your station ID and set the size of the image saved
ggsave(plot, file=paste(station_list[i], "hydrograph.png", sep='_'), width=3.6, height = 2.2, units=c("in"))
}
Check your working directory to see all of the images saved there. ggplot2 also has the facet_wrap() function that allows you to make a grid of plots.
ggplot(qdata, mapping = aes(x=Date, y=Q_cfs, color=TypeofQ)) +
geom_line(size=0.75) +
facet_wrap(~StationAbb, scales = "free", ncol = 3)+ #indicate with ~ what column to wrap by
theme(strip.background = element_blank()) +
theme(strip.text = element_text(size=10)) #sets the size of the label above each graph
apply() functions can be used in similar ways as loops. The lapply() in particular is very similar to the above loops. lapply() will apply a function based on a list you provide it. Above we used station IDs as a list. You can also upload multiple data files using a list of files in a specified folder. Below is an example of using lapply() with the read_csv() function and bind_row() functions to read and bind all of the files in the example folder. This can then be formatted properly based on date and analyzed.
#upload all of the .csv files in your folder
#you will need to specify the folder yourself
setwd("C:/Users/Lexie/OneDrive - The Pennsylvania State University/WEB Lab Files/Courses/FOR 602/FOR 597/Temperature_Example")
files <- dir(pattern = "*.csv")
example4 <- files %>%
lapply(read_csv) %>%
bind_rows
Double check your example4 dataframe to make sure it includes all of the data.
Now all of the files have been uploaded into one .csv file. Let’s convert the data to POSIXct then calculate daily averages and plot them. We can do this in base R with a loop or also using dplyr.
#format date-time and create a column for day
example4 = example4 %>%
drop_na(DateTime) %>% #remove any missing dates
mutate(DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %H:%M")) %>%
mutate(Day = as.Date(DateTime, format="%d"))
#create a list based on days in the data file
daylist = unique(example4$Day)
#create an empty dataframe to store daily average temperature
df = data.frame()
k=1
#calculate daily average and standard error for each day and store in the empty dataframe
for (i in seq_along(daylist)) {
#calculate mean avg T
example4_2 <- subset(example4, example4$Day==daylist[i])
example4_2 = example4_2 %>%
mutate(Average_T = mean(Temp_C, na.rm=T)) %>%
mutate(SE = sd(Temp_C, na.rm=T)/sqrt(length(Temp_C)))
#add mean and SE for each day to new data frame
df[k,1] = paste(daylist[i])
df[k,2] = mean(example4_2$Average_T)
df[k,3] = mean(example4_2$SE)
k = k+1
}
#name the columns of your dataframe appropriately
names(df) = c("Day", "Avg_T", "SE")
df$Day = as.Date(df$Day, format="%Y-%m-%d")
#this can also be done quickly by grouping the data by day
example4_average = example4 %>%
mutate(DateTime = as.POSIXct(DateTime, format="%m/%d/%Y %H:%M")) %>%
mutate(Day = as.Date(DateTime, format="%d")) %>%
group_by(Day) %>%
summarise(Avg_T = mean(Temp_C))
Finally, let’s plot this data and make sure both methods resulted in the same data!
This method can be done to calculate yearly, monthly and seasonal avaerages as well. Using a loop, one could easily do this for multiple sites at a time.
Assignment to follow…