In this project, I explored the development of a regression model that used multiple time series variables and critiqued the model.

Preparation and About the Data

The following code was executed to load the data needed:

source("code_Time_Series.R")
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
load("data_ice.Rdata")

This data set, consists of five time series objects, contains bus ridership numbers for Iowa City Transit for the 11-year period from January 1978 to December 1988 as well as several variables that pertain to bus ridership. The time series are:


Data Exploration

Let’s start by examining the raw data and some of its basic characteristics.

Plotted both rides and log(rides) to see which has more stable variation.

autoplot(rides)

autoplot(log(rides))

The variation of the logged data is more stable, so I plan to work with it over the original data.

Understanding All Variables

Executed the following chunks of code to get a feel for the other variables.

autoplot(students)

autoplot(spaces)

autoplot(rp_fare)

autoplot(rp_gas)

Regression Model

Used tslm to regress log(rides) on trend and season. Then used aa_plot_fitted(fit) to assess the fit visually.

fit <- tslm(log(rides) ~ trend + season)
aa_plot_fitted(fit)

The fit doesn’t look very good. For example, it doesn’t capture the overall U-shape of the time series.

Expanding the Regression Model with Multiple Series

Next, expanded the prior regression to include the variables students, spaces, log(rp_fare), and log(rp_gas). With this, I will assess the fit with against a number of time-series diagnostics.

fit <- tslm(log(rides) ~ trend + season + students + spaces + log(rp_fare) +
    log(rp_gas))
aa_plot_fitted(fit)

summary(fit)
## 
## Call:
## tslm(formula = log(rides) ~ trend + season + students + spaces + 
##     log(rp_fare) + log(rp_gas))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14811 -0.04624 -0.01090  0.03660  0.17276 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2847901  0.2159824  15.209  < 2e-16 ***
## trend        -0.0045344  0.0004931  -9.196 1.94e-15 ***
## season2       0.1171999  0.0288228   4.066 8.78e-05 ***
## season3       0.0262450  0.0288455   0.910  0.36481    
## season4      -0.0384786  0.0288736  -1.333  0.18528    
## season5      -0.3125263  0.0289249 -10.805  < 2e-16 ***
## season6      -0.4006730  0.0289462 -13.842  < 2e-16 ***
## season7      -0.4097179  0.0290188 -14.119  < 2e-16 ***
## season8      -0.4343003  0.0290403 -14.955  < 2e-16 ***
## season9      -0.1369896  0.0288935  -4.741 6.14e-06 ***
## season10     -0.0022152  0.0288779  -0.077  0.93899    
## season11     -0.0569882  0.0289031  -1.972  0.05105 .  
## season12     -0.0811247  0.0289399  -2.803  0.00594 ** 
## students      0.0813214  0.0059164  13.745  < 2e-16 ***
## spaces       -0.0121496  0.0401286  -0.303  0.76261    
## log(rp_fare) -0.1556497  0.0952207  -1.635  0.10486    
## log(rp_gas)   0.4987317  0.0395896  12.598  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06758 on 115 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9322 
## F-statistic: 113.5 on 16 and 115 DF,  p-value: < 2.2e-16
accuracy(fit)
##              ME       RMSE        MAE         MPE      MAPE      MASE      ACF1
## Training set  0 0.06307404 0.05013463 -0.01579507 0.9962192 0.4901869 0.5346375
checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 69.805, df = 24, p-value = 2.34e-06

This fit looks very good except a few insignificant variables and some autocorrelation. We could improve the autocorrelation by bringing in lag of rides into the model.