Adapted from Mark Niemann-Ross’s LinkdIn tutorial. All datasets for this tutorial may be found in the class datasets link: http://bit.ly/data110datasets
Dates information can be challenging, due to the extreme variability of how it is recorded in a particular dataset. The goal is to understand how to ensure R recognises the format you are using as an actual date.
Load the libraries and set working
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date
dbl (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): YearMonth
dbl (1): recording
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rr_YMD <-read_csv("year_month_day_sample.csv")
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): YearMonth
dbl (1): recording
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View “dates” dataset
head(dates)
# A tibble: 6 × 2
Date Value
<chr> <dbl>
1 2020-01 25
2 2020-02 35
3 2020-03 18
4 2020-04 27
5 2020-05 22
6 2020-06 36
Find the class for the Date variable
class(dates$Date)
[1] "character"
Notice the variable, “Date” is a character string. Change it to be recognized as dates
Use lubridate to make “Date” a recognized date
dates$fixdate <- lubridate::ym(dates$Date)
Check the class for the new variable “fixdate” is correct
Now you can see that the dates are recognized and recorded properly
R has several ways to store Dates and Times
Date - just stores dates, not times (and all dates are calculated as number of seconds since 1970-01-01, the epoch) POSIXct - Portable Operating System Interface - supports dates AND times (provides time zone as well) in vector format; this tends to be easier than POSIXlt POSIXlt - similar to POSIXct, but values are stored in list format; to see all values, use unclass
As long as you are not working with times, use the Date class.
Unexpected formats for dates make things tricky
January 7, 2020
1/7/20
Tuesday, Jan 7, 2020
Or something else
# Here is an example of a string that doesn't work - remove the hashtag to run it# as.Date("January 7, 2020") # this code returns an error we will fix in the next chunk
Notice the error: Error in charToDate(x) : character string is not in a standard unambiguous format
# Include the formatas.Date("January 7, 2020", format ="%B %d, %Y")
[1] "2020-01-07"
# %B indicates a full month name, %d is the day, an %Y is the 4-digit year (%y gives the 2-digit year)
Now you can see that R understands this date
# to get a complete list of time formats - remove the hashtag to run it# ?strptime
Convert a specific date
# On Friday, March 13, 2020, most of the United States went to lockdown due to the COVID19 pandemicpandemic <-"Friday, March 13, 2020"#convert to Date classas.Date(pandemic, format ="%A, %B %d, %Y")
[1] "2020-03-13"
Date/Time conversions are not always clear or easy, but as long as you have the tools, it will make much more sense
Comparing and Manipulating Dates
#Compare the epoch to today's dateepoch <-"1970-01-01"today <-Sys.Date()# Try logical statements:today > epoch
[1] TRUE
Create a sequence of dates
DaysofJan <-seq.Date(from =as.Date("2020/01/01"), to =as.Date("2020/01/31"),by ="day")DaysofJan
am(PabloPicassoBday) # Was he born in the morning?
[1] FALSE
leap_year(PabloPicassoBday) # Was he born on a leap year?
[1] FALSE
Rounding and Truncating dates and times with lubridate
# from base RSys.time()
[1] "2023-09-08 15:46:31 EDT"
round(Sys.time(), "hour")
[1] "2023-09-08 16:00:00 EDT"
round(Sys.time(), "year")
[1] "2024-01-01 EST"
trunc(Sys.time(), "year")
[1] "2023-01-01 EST"
Lubridate provides similar functions, but easier to understand
now() # equivalent to Sys.time()
[1] "2023-09-08 15:46:31 EDT"
floor_date(now(), unit ="month") # round down
[1] "2023-09-01 EDT"
floor_date(now(), unit ="year")
[1] "2023-01-01 EST"
round_date(now(), unit ="hour") # round to nearest unit
[1] "2023-09-08 16:00:00 EDT"
ceiling_date(now(), unit ="minutes") # round up
[1] "2023-09-08 15:47:00 EDT"
# last day of previous monthrollback(now(), roll_to_first =FALSE, preserve_hms =FALSE)
[1] "2023-08-31 EDT"
Time Series Data
Time series data may show patterns; data arrives in chronological order
Time Series Classes in R
ts - time series
zoo - Zeileis’s ordered observations (r indexed totally ordered observations, such as discrete irregular time series, example is Stock Market Data)
Time series analysis is a complex topic that involves heavy statistics and matrix manipulations
Class TS for time series
# data is a matrix# creating sample data. Four sine wavesmydata <-matrix(c(sin(seq(0,10,0.1)), sin(seq(1,11,0.1)),sin(seq(2,12,0.1)),sin(seq(3,13,0.1))),ncol =4)# columns are series of observations and frequency / deltat are rowshead(mydata)
Series 1 Series 2 Series 3 Series 4
[1,] 0.00000000 0.8414710 0.9092974 0.14112001
[2,] 0.09983342 0.8912074 0.8632094 0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
# instead of frequency, delta between observations# use frequency OR deltadeltaObserve <-1/12ts1 <-ts(mydata, start =2019,deltat = deltaObserve)head(ts1)
Series 1 Series 2 Series 3 Series 4
[1,] 0.00000000 0.8414710 0.9092974 0.14112001
[2,] 0.09983342 0.8912074 0.8632094 0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
# name the seriespumpNames <-c("East", "West", "North", "South")ts2 <-ts(mydata,end =2019,frequency = howOftenObserved,names = pumpNames)head(ts2)
East West North South
[1,] 0.00000000 0.8414710 0.9092974 0.14112001
[2,] 0.09983342 0.8912074 0.8632094 0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
# adding time seriesendSeries <-ts(mydata,end =2019,frequency = howOftenObserved,names = pumpNames)# Now the column names are the pump names: East, West, North, and Southhead(endSeries)
East West North South
[1,] 0.00000000 0.8414710 0.9092974 0.14112001
[2,] 0.09983342 0.8912074 0.8632094 0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
Use the “zoo” package
Zoo is an abbreviation used in time series analysis, which stands for “Z’s ordered observations”
Read in weather data from a table in a website
library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
‘tibbletime’ is an extension that allows for the creation of time aware tibbles.
library(tibbletime)
Attaching package: 'tibbletime'
The following object is masked from 'package:stats':
filter
Create a time series plot using ggplot
p1 <-read_table("https://raw.githubusercontent.com/lyndadotcom/LPO_weatherdata/master/Environmental_Data_Deep_Moor_2015.txt") |>unite(datetime, date,time) |>mutate(datetime =ymd_hms((datetime))) |>tbl_time(index = datetime) |># tibbletime is built to live in the tidyverseggplot( aes(x = datetime, y = Air_Temp))+geom_line(color ="#00AFBB", linewidth = .25) +ggtitle("Time Series Plot of Air Temperatures in 2015")