In this project, I explored the development of a regression model that used multiple time series variables and critiqued the model.
The following code was executed to load the data needed:
source("code_Time_Series.R")
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
load("data_ice.Rdata")
This data set, consists of five time series objects, contains bus ridership numbers for Iowa City Transit for the 11-year period from January 1978 to December 1988 as well as several variables that pertain to bus ridership. The time series are:
rides = number of bus riders during given month (in 1,000’s)students = number of students enrolled at the University of Iowa in Iowa City during the corresponding fall semester (in 1,000’s)spaces = number of downtown parking spaces in Iowa City during the given month (in 1,000’s)rp_fare = bus fare (in Jan 1978 dollars) for a single ride; prices have been deflated to Jan 1978; the prefix “rp” stands for “real price”rp_gas = real price of a gallon of gas (in Jan 1978 dollars)Let’s start by examining the raw data and some of its basic characteristics.
Plotted both rides and log(rides) to see which has more stable variation.
autoplot(rides)
autoplot(log(rides))
The variation of the logged data is more stable, so I plan to work with it over the original data.
Executed the following chunks of code to get a feel for the other variables.
autoplot(students)
autoplot(spaces)
autoplot(rp_fare)
autoplot(rp_gas)
Used tslm to regress log(rides) on trend and season. Then used aa_plot_fitted(fit) to assess the fit visually.
fit <- tslm(log(rides) ~ trend + season)
aa_plot_fitted(fit)
The fit doesn’t look very good. For example, it doesn’t capture the overall U-shape of the time series.
Next, expanded the prior regression to include the variables students, spaces, log(rp_fare), and log(rp_gas). With this, I will assess the fit with against a number of time-series diagnostics.
fit <- tslm(log(rides) ~ trend + season + students + spaces + log(rp_fare) +
log(rp_gas))
aa_plot_fitted(fit)
summary(fit)
##
## Call:
## tslm(formula = log(rides) ~ trend + season + students + spaces +
## log(rp_fare) + log(rp_gas))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14811 -0.04624 -0.01090 0.03660 0.17276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2847901 0.2159824 15.209 < 2e-16 ***
## trend -0.0045344 0.0004931 -9.196 1.94e-15 ***
## season2 0.1171999 0.0288228 4.066 8.78e-05 ***
## season3 0.0262450 0.0288455 0.910 0.36481
## season4 -0.0384786 0.0288736 -1.333 0.18528
## season5 -0.3125263 0.0289249 -10.805 < 2e-16 ***
## season6 -0.4006730 0.0289462 -13.842 < 2e-16 ***
## season7 -0.4097179 0.0290188 -14.119 < 2e-16 ***
## season8 -0.4343003 0.0290403 -14.955 < 2e-16 ***
## season9 -0.1369896 0.0288935 -4.741 6.14e-06 ***
## season10 -0.0022152 0.0288779 -0.077 0.93899
## season11 -0.0569882 0.0289031 -1.972 0.05105 .
## season12 -0.0811247 0.0289399 -2.803 0.00594 **
## students 0.0813214 0.0059164 13.745 < 2e-16 ***
## spaces -0.0121496 0.0401286 -0.303 0.76261
## log(rp_fare) -0.1556497 0.0952207 -1.635 0.10486
## log(rp_gas) 0.4987317 0.0395896 12.598 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06758 on 115 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9322
## F-statistic: 113.5 on 16 and 115 DF, p-value: < 2.2e-16
accuracy(fit)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0 0.06307404 0.05013463 -0.01579507 0.9962192 0.4901869 0.5346375
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 69.805, df = 24, p-value = 2.34e-06
This fit looks very good except a few insignificant variables and some autocorrelation. We could improve the autocorrelation by bringing in lag of rides into the model.