An effective initial step for characterizing the nature of a time series and for detecting potential problems is to use data visualization. By visualizing the series we can detect initial patterns, identify its components and spot potential problems such as extreme values, unequal spacing, and missing values.
The most basic and informative plot for visualizing a time series is the time plot. In its simplest form, a time plot is a line chart of the series values (\(y_1, y_2, ...\)) over time (\(t = 1,2,…\)), with temporal labels (e.g., calendar date) on the horizontal axis.
To illustrate this, consider the example of ridership on Amtrak trains in the USA. A time plot for monthly Amtrak ridership series is shown below.
The data on Amtrak ridership is available on www.forecastingbook.com/mooc
For more examples and further details on visualizing time series, see Chapter 3 in the textbook Practical Time Series Forecasting.
Amtrak.data <- read.csv("Amtrak data.csv") %>% mutate(MonthYear = tsibble::yearmonth(format(as.Date(Month,"%d/%m/%Y"))))
print(paste0("min date: ", min(Amtrak.data$MonthYear), " ||| max date:", max(Amtrak.data$MonthYear)))
## [1] "min date: 1991 Jan ||| max date:2004 Mar"
print(paste0("min ridership: ", min(Amtrak.data$Ridership), " ||| max ridership: ", max(Amtrak.data$Ridership)))
## [1] "min ridership: 1360.852 ||| max ridership: 2223.349"
ridership.ts <- ts(Amtrak.data$Ridership, start = c(1991,1), end = c(2004,3), freq = 12)
plot(ridership.ts, xlab = "Time", ylab = "Readership", ylim = c(1300,2300), bty = "l")
The function ts creates a time series object out of the data frame’s first column Amtrak.data$Ridership. We give the time series the name ridership.ts. This time series starts in January 1991, ends in March 2004, and has a frequency of 12 months per year. By defining its frequency as 12, we can later use other functions to examine its seasonal pattern. The third line above produces the actual plot of the time series. R gives us control over the labels, axes limits, and the plot’s border type.
Note that the values are in thousands of riders. Looking at the time plot reveals the nature of the series components: the overall level is around 1,800,000 passengers per month. A slight U-shaped trend is discernible during this period, with pronounced annual seasonality; peak travel occurs during the summer months of July and August.
ridership.tsibble <- tsibble::as_tsibble(Amtrak.data %>% select(MonthYear,Ridership),
index = MonthYear,
) %>% arrange(MonthYear)
feasts::autoplot(ridership.tsibble)
plot(ts(Amtrak.data$Ridership, start = c(1991,1), end = c(1991,12), freq = 12),
xlab = "Time", ylab = "Readership", ylim = c(1300,2300), bty = "l")
ridership.tsibble %>%
feasts::gg_season(y = Ridership, period = "year")+
labs(title = "Seasonal plots", x = "Month")
A second step in visualizing a time series is to examine it more carefully. The following operations are useful:
Zooming in: Zooming in to a shorter period within the series can reveal patterns that are hidden when viewing the entire series. This is especially important when the time series is long.
Changing the Scale: To better identify the shape of a trend, it is useful to change the scale of the series. One simple option is to change the vertical scale (of y) to a logarithmic scale. If the trend on the new scale appears more linear, then the trend in the original series is closer to an exponential trend.
Adding Trend Lines: Another possibility for better capturing the shape of the trend is to add a trend line. By trying different trend lines one can see what type of trend (e.g., linear, exponential, cubic) best approximates the data.
Suppressing Seasonality: It is often easier to see trends in the data when seasonality is suppressed. Suppressing seasonal patterns can be done by plotting the series at a cruder time scale (e.g., aggregating monthly data into years). A second option is to plot separate time plots for each season. A third, popular option is to use moving average plots. We discuss moving average plots in Week 4.
# Create subseries plot
ridership.tsibble %>%
feasts::gg_subseries(Ridership
, period = "year" # this can be omitted
)+
labs(title = "Subseries plot by Month", x = "Year")
The advantage of methods that rely on such assumptions is that when the assumptions are reasonably met, the resulting forecasts will be more robust and the models more understandable. In contrast, data-driven forecasting methods make fewer assumptions about the structure of these components and instead try to estimate them only from the data.
The goal of visualising time series is to identify interesting features, such as: trends, seasonality, outliers, missing values and irregular patterns. Identifying these components is extremely useful for 2 purposes:
- For choosing the appropriate forecasting methods: some methods are more suitable only for series without seasonality. Some methods can’t deal with missing values, while others can.
- Once we fit a forecasting model, we will examine its performance by comparing the original series to the forecasted series. Doing that visually is very useful.
library(readxl)
Sept11Travel <-
readxl::read_excel("Sept11Travel.xls") %>%
mutate(MonthYear = tsibble::yearmonth(format(as.Date(Month,"%d/%m/%Y"))))
To assess the impact of September 11, BTS took the following approach: Using data before September 11, it forecasted future data (under the assumption of no terrorist attack). Then, BTS compared the forecasted series with the actual data to assess the impact of the event.
There is seasonality (peak in June-August) and a positive linear trend (although there was a deep in September 2001)
Sept11Travel.Air <- tsibble::as_tsibble(
Sept11Travel %>% select(MonthYear,`Air RPM (000s)`),
index = MonthYear
) %>% arrange(MonthYear)
feasts::autoplot(Sept11Travel.Air) + labs(title = "Seasonal plots: Travel by Airplane", x = "Time")
Sept11Travel.Air %>%
feasts::gg_season(y = `Air RPM (000s)`, period = "year")+
labs(title = "Seasonal plots: Travel by Airplane", x = "Month")
# Create subseries plot
Sept11Travel.Air %>%
feasts::gg_subseries(`Air RPM (000s)`
, period = "year" # this can be omitted
)+
labs(title = "Subseries plot by Month", x = "Year")
There is seasonality (peak in July-August) and the trend is initially steady, then it declines and then it raises again
Sept11Travel.Train <- tsibble::as_tsibble(
Sept11Travel %>% select(MonthYear,`Rail PM`),
index = MonthYear
) %>% arrange(MonthYear)
feasts::autoplot(Sept11Travel.Train)
Sept11Travel.Train %>%
feasts::gg_season(y = `Rail PM`, period = "year")+
labs(title = "Seasonal plots: Travel by Rail", x = "Month")
# Create subseries plot
Sept11Travel.Train %>%
feasts::gg_subseries(`Rail PM`
, period = "year" # this can be omitted
)+
labs(title = "Subseries plot by Month", x = "Year")
There is seasonality (peak in July-August) and a positive trend
Sept11Travel.Car <- tsibble::as_tsibble(
Sept11Travel %>% select(MonthYear,`VMT (billions)`),
index = MonthYear
) %>% arrange(MonthYear)
feasts::autoplot(Sept11Travel.Car)
Sept11Travel.Car %>%
feasts::gg_season(y = `VMT (billions)`, period = "year")+
labs(title = "Seasonal plots: Travel by Car", x = "Month")
# Create subseries plot
Sept11Travel.Car %>%
feasts::gg_subseries(`VMT (billions)`
, period = "year" # this can be omitted
)+
labs(title = "Subseries plot by Month", x = "Year")
Get started on this competition through Kaggle Scripts
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
bike_sharing<-read.csv("Kaggle_bike_sharing_train.csv")
library(ggplot2)
library(lubridate)
library(readr)
library(scales)
library(plotly)
bike_sharing$hour <- hour(ymd_hms(bike_sharing$datetime))
bike_sharing$times <- as.POSIXct(strftime(ymd_hms(bike_sharing$datetime), format="%H:%M:%S"), format="%H:%M:%S")
bike_sharing$day <- wday(ymd_hms(bike_sharing$datetime), label=TRUE)
rent_plot<-ggplot(bike_sharing, aes(x=times, y=count, color=day)) +
geom_smooth(ce=FALSE, fill=NA, size=2) +
theme_light(base_size=20) +
xlab("Hour of the Day") +
scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) +
ylab("Number of Bike Rentals") +
scale_color_discrete("") +
ggtitle("N bike rentals by day of the week and hour") +
theme(plot.title=element_text(size=18),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 12))
temp_plot<-ggplot(bike_sharing, aes(x=times, y=temp)) +
geom_smooth(ce=FALSE, fill=NA, size=2) +
theme_light(base_size=20) +
xlab("Hour of the Day") +
scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) +
ylab("Temperature [Celsius]") +
#scale_color_discrete("") +
ggtitle("Temperature by day of the week and hour") +
theme(plot.title=element_text(size=18),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 12))
ggplotly(rent_plot)
ggplotly(temp_plot)