An effective initial step for characterizing the nature of a time series and for detecting potential problems is to use data visualization. By visualizing the series we can detect initial patterns, identify its components and spot potential problems such as extreme values, unequal spacing, and missing values.
The most basic and informative plot for visualizing a time series is the time plot. In its simplest form, a time plot is a line chart of the series values ($y_1, y_2, ...$) over time ($t = 1,2,…$), with temporal labels (e.g., calendar date) on the horizontal axis.
To illustrate this, consider the example of ridership on Amtrak trains in the USA. A time plot for monthly Amtrak ridership series is shown below.
The data on Amtrak ridership is available on www.forecastingbook.com/mooc
For more examples and further details on visualizing time series, see Chapter 3 in the textbook Practical Time Series Forecasting.

1 Import time series data

Amtrak.data <- read.csv("Amtrak data.csv") %>% mutate(MonthYear = tsibble::yearmonth(format(as.Date(Month,"%d/%m/%Y"))))
print(paste0("min date: ", min(Amtrak.data$MonthYear), " ||| max date:", max(Amtrak.data$MonthYear)))

## [1] "min date: 1991 Jan ||| max date:2004 Mar"

print(paste0("min ridership: ", min(Amtrak.data$Ridership), " ||| max ridership: ", max(Amtrak.data$Ridership)))

## [1] "min ridership: 1360.852 ||| max ridership: 2223.349"

2 Create time series object

ridership.ts <- ts(Amtrak.data$Ridership, start = c(1991,1), end = c(2004,3), freq = 12)

3 Plot time series

plot(ridership.ts, xlab = "Time", ylab = "Readership", ylim = c(1300,2300), bty = "l")

The function ts creates a time series object out of the data frame’s first column Amtrak.data$Ridership. We give the time series the name ridership.ts. This time series starts in January 1991, ends in March 2004, and has a frequency of 12 months per year. By defining its frequency as 12, we can later use other functions to examine its seasonal pattern. The third line above produces the actual plot of the time series. R gives us control over the labels, axes limits, and the plot’s border type.
Note that the values are in thousands of riders. Looking at the time plot reveals the nature of the series components: the overall level is around 1,800,000 passengers per month. A slight U-shaped trend is discernible during this period, with pronounced annual seasonality; peak travel occurs during the summer months of July and August.

ridership.tsibble <- tsibble::as_tsibble(Amtrak.data %>% select(MonthYear,Ridership),
                        index = MonthYear,
                        ) %>% arrange(MonthYear)

feasts::autoplot(ridership.tsibble)

plot(ts(Amtrak.data$Ridership, start = c(1991,1), end = c(1991,12), freq = 12), 
     xlab = "Time", ylab = "Readership", ylim = c(1300,2300), bty = "l")

ridership.tsibble %>%
  feasts::gg_season(y = Ridership, period = "year")+
   labs(title = "Seasonal plots", x = "Month")

A second step in visualizing a time series is to examine it more carefully. The following operations are useful:

Zooming in: Zooming in to a shorter period within the series can reveal patterns that are hidden when viewing the entire series. This is especially important when the time series is long.
Changing the Scale: To better identify the shape of a trend, it is useful to change the scale of the series. One simple option is to change the vertical scale (of y) to a logarithmic scale. If the trend on the new scale appears more linear, then the trend in the original series is closer to an exponential trend.
Adding Trend Lines: Another possibility for better capturing the shape of the trend is to add a trend line. By trying different trend lines one can see what type of trend (e.g., linear, exponential, cubic) best approximates the data.
Suppressing Seasonality: It is often easier to see trends in the data when seasonality is suppressed. Suppressing seasonal patterns can be done by plotting the series at a cruder time scale (e.g., aggregating monthly data into years). A second option is to plot separate time plots for each season. A third, popular option is to use moving average plots. We discuss moving average plots in Week 4.

# Create subseries plot
ridership.tsibble  %>%
  feasts::gg_subseries(Ridership
                       , period = "year" # this can be omitted
                       )+
   labs(title = "Subseries plot by Month", x = "Year")

4 Why should we try and identify the series components?

Some forecasting methods directly model these components by making assumptions about their structure.
- For example, a popular assumption about a trend is that it is linear or exponential over some, or all, of the given time period.
- Another common assumption is about the noise structure: many statistical methods assume that the noise follows a normal distribution.

The advantage of methods that rely on such assumptions is that when the assumptions are reasonably met, the resulting forecasts will be more robust and the models more understandable. In contrast, data-driven forecasting methods make fewer assumptions about the structure of these components and instead try to estimate them only from the data.

Time plots are also useful for characterizing the global or local nature of the patterns.
- A global pattern is one that is relatively constant throughout the series. An example is a linear trend throughout the entire series.
- In contrast, a local pattern is one that occurs only in a short period of the data, and then changes. An example is a trend that is approximately linear within four neighboring time points, but the trend size (slope) changes slowly over time. Operations such as zooming in can help establish more subtle changes in seasonal patterns or trends across periods. Breaking down the series into multiple sub-series and overlaying them in a single plot can also help establish whether a pattern (such as weekly seasonality) changes from season to season.

5 Interactive Visualization

The various operations described above: zooming in, changing scales, adding trend lines, aggregating temporally, breaking down the series into multiple time plots, are all possible using software such as Excel and R. However, each operation requires generating a new plot or at least going through several steps until the modified chart is achieved. The time lag between manipulating the chart and viewing the results detracts from our ability to compare and “connect the dots” between the different visualizations. Interactive visualization software offer the same functionality (and usually much more), but with the added benefit of very quick and easy chart manipulation. An additional powerful feature of interactive software is the ability to link multiple plots of the same data. Operations in one plot (such as zooming in) will then automatically also be applied to all the linked plots. A set of such linked charts is often called a dashboard.

The goal of visualising time series is to identify interesting features, such as: trends, seasonality, outliers, missing values and irregular patterns. Identifying these components is extremely useful for 2 purposes:
- For choosing the appropriate forecasting methods: some methods are more suitable only for series without seasonality. Some methods can’t deal with missing values, while others can.
- Once we fit a forecasting model, we will examine its performance by comparing the original series to the forecasted series. Doing that visually is very useful.

5.1 Question 1

The Research and Innovative Technology Administration’s Bureau of Transportation Statistics (BTS) conducted a study to evaluate the impact of the September 11, 2001, terrorist attack on U.S. transportation. The goal of the study was stated as follows: “The purpose of this study is to provide a greater understanding of the passenger travel behavior patterns of persons making long distance trips before and after September 11.” The report analyzes monthly passenger movement data between January 1990 and April 2004. Data on three monthly time series for this period are given in the file Sept11Travel.xls

library(readxl)
Sept11Travel <- 
   readxl::read_excel("Sept11Travel.xls") %>% 
   mutate(MonthYear = tsibble::yearmonth(format(as.Date(Month,"%d/%m/%Y"))))

actual airline revenue passenger miles (Air)
rail passenger miles (Rail)
vehicle miles traveled (Auto)

To assess the impact of September 11, BTS took the following approach: Using data before September 11, it forecasted future data (under the assumption of no terrorist attack). Then, BTS compared the forecasted series with the actual data to assess the impact of the event.

5.1.1 Travel by airplane

There is seasonality (peak in June-August) and a positive linear trend (although there was a deep in September 2001)

Sept11Travel.Air <- tsibble::as_tsibble(
   Sept11Travel %>% select(MonthYear,`Air RPM (000s)`),
   index = MonthYear
   ) %>% arrange(MonthYear)

feasts::autoplot(Sept11Travel.Air) + labs(title = "Seasonal plots: Travel by Airplane", x = "Time")

Sept11Travel.Air %>%
  feasts::gg_season(y = `Air RPM (000s)`, period = "year")+
   labs(title = "Seasonal plots: Travel by Airplane", x = "Month")

# Create subseries plot
Sept11Travel.Air  %>%
  feasts::gg_subseries(`Air RPM (000s)`
                       , period = "year" # this can be omitted
                       )+
   labs(title = "Subseries plot by Month", x = "Year")

5.1.2 Travel by Rail

There is seasonality (peak in July-August) and the trend is initially steady, then it declines and then it raises again

Sept11Travel.Train <- tsibble::as_tsibble(
   Sept11Travel %>% select(MonthYear,`Rail PM`),
   index = MonthYear
   ) %>% arrange(MonthYear)

feasts::autoplot(Sept11Travel.Train)

Sept11Travel.Train %>%
  feasts::gg_season(y = `Rail PM`, period = "year")+
   labs(title = "Seasonal plots: Travel by Rail", x = "Month")

# Create subseries plot
Sept11Travel.Train  %>%
  feasts::gg_subseries(`Rail PM`
                       , period = "year" # this can be omitted
                       )+
   labs(title = "Subseries plot by Month", x = "Year")

5.1.2.1 Travel by Car

There is seasonality (peak in July-August) and a positive trend

Sept11Travel.Car <- tsibble::as_tsibble(
   Sept11Travel %>% select(MonthYear,`VMT (billions)`),
   index = MonthYear
   ) %>% arrange(MonthYear)

feasts::autoplot(Sept11Travel.Car)

Sept11Travel.Car %>%
  feasts::gg_season(y = `VMT (billions)`, period = "year")+
   labs(title = "Seasonal plots: Travel by Car", x = "Month")

# Create subseries plot
Sept11Travel.Car  %>%
  feasts::gg_subseries(`VMT (billions)`
                       , period = "year" # this can be omitted
                       )+
   labs(title = "Subseries plot by Month", x = "Year")

5.1.3 Kaggle challenge

Get started on this competition through Kaggle Scripts
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

bike_sharing<-read.csv("Kaggle_bike_sharing_train.csv")

library(ggplot2)
library(lubridate)
library(readr)
library(scales)
library(plotly)

bike_sharing$hour  <- hour(ymd_hms(bike_sharing$datetime))
bike_sharing$times <- as.POSIXct(strftime(ymd_hms(bike_sharing$datetime), format="%H:%M:%S"), format="%H:%M:%S")
bike_sharing$day   <- wday(ymd_hms(bike_sharing$datetime), label=TRUE)

rent_plot<-ggplot(bike_sharing, aes(x=times, y=count, color=day)) +
     geom_smooth(ce=FALSE, fill=NA, size=2) +
     theme_light(base_size=20) +
     xlab("Hour of the Day") +
     scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) + 
     ylab("Number of Bike Rentals") +
     scale_color_discrete("") +
     ggtitle("N bike rentals by day of the week and hour") +
     theme(plot.title=element_text(size=18),
           axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 12))

temp_plot<-ggplot(bike_sharing, aes(x=times, y=temp)) +
     geom_smooth(ce=FALSE, fill=NA, size=2) +
     theme_light(base_size=20) +
     xlab("Hour of the Day") +
     scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) + 
     ylab("Temperature [Celsius]") +
     #scale_color_discrete("") +
     ggtitle("Temperature by day of the week and hour") +
     theme(plot.title=element_text(size=18),
           axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 12))

ggplotly(rent_plot)

ggplotly(temp_plot)

FutureLearn - Forecasting Week 2

Aura Frizzati

01 May, 2023