STDS Vignette : Regression Modelling with Time Series Data in R

In any business environment, time series data is of core importance as it is what a company uses to describe their actual performance in the past and what they can use to predict the future.

In this vignette I will provide examples techniques and functions in R that can be used to analyse time series data and make predictions based on known actuals.

Data

The dataset used to create this vignette is about the daily statistics on bike hiring covering the period for 2011-2012, and it was borrowed from an open source UCI Machine Learning Repository [1]. This dataset was used in one very good tutorial on ARIMA modelling in R, but ARIMA will not be in the focus of my vignette (though the method is interesting and you can go through this tutorial here https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials [2]). To provide some examples I also used in-built data in base R (e.g. austres, AirPassengers, LakeHuron)

Beginning: Visualise and Explore

The first step (after loading the data) is to visualise it on a plot - this will provide you with insights on the patterns and trends in the data that are important to take into account for future modelling.

setwd("C:/Users/SS/Desktop/UTS/36103_Statistical_Thinking_for_Data_Science/R_wd")

bike <- read.csv("C:/Users/SS/Desktop/UTS/36103_Statistical_Thinking_for_Data_Science/R_wd/day.csv", header=TRUE, stringsAsFactors=FALSE)
head(bike)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

Plot bike hiring count as a line plot.

bike$date <- as.Date(bike$dteday)
install.packages("ggplot2", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(ggplot2)
ggplot(bike, aes(date, cnt)) + geom_line() + ylab("Daily Bike Hiring Count") +
  xlab("date")

The data looks quite volatile, with distinct deeps/outliers; it seemed to be on the growing trend until October’2012 but then it started to decline. We may look at box plots for better understanding as it is often easier to see seasonality pattern or cycles in a box plot.

ggplot(bike, aes(y=cnt, x = weekday)) + geom_boxplot(aes(group=weekday)) + scale_x_continuous(breaks=seq(0,7,1)) + ggtitle("Bike Hiring Count by Day")

ggplot(bike, aes(y=cnt, x = mnth)) + geom_boxplot(aes(group=mnth)) + scale_x_continuous(breaks=seq(0,12,1), labels=c("","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) + ggtitle("Bike Hiring Count by Month")

One of the essential criteria for the time series data to be usable for a regression model is stationarity of data, which implies that neither the mean, nor variance, nor autocorrelation are a function of time (e.g. [3]). When we look at the graphs, we often can see if there is a trend or seasonality cycles in data, but running Dicker-Fuller test is a more reliable method to answer the question about stationarity. There are at least 2 functions know in R that we may use to run Dicker-Fuller test: adf.test function from tseries package and adfTest function from fUnitRoots. Let’s try them both and compare results.

install.packages("tseries", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'tseries' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(tseries)
adf.test(bike$cnt, alternative = "stationary", k=0)

## Warning in adf.test(bike$cnt, alternative = "stationary", k = 0): p-value
## smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  bike$cnt
## Dickey-Fuller = -10.052, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

install.packages("fUnitRoots", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'fUnitRoots' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(fUnitRoots)

## Loading required package: timeDate

## Loading required package: timeSeries

## Loading required package: fBasics

adfTest(bike$cnt, lags = 0, type = "c") #this function requires an argument for type - "c", "ct", or "nc" - let's do it with default type parameter "c" - which means presence of constant (intercept) but absence of time trend

## Warning in adfTest(bike$cnt, lags = 0, type = "c"): p-value smaller than
## printed p-value

## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 0
##   STATISTIC:
##     Dickey-Fuller: -7.8083
##   P VALUE:
##     0.01 
## 
## Description:
##  Thu Mar 29 21:48:21 2018 by user: SS

Both tests returned small p-value below 0.01, which means that the alternative hypothesis is true and our data is stationary.

This could be expected as daily data normally contain a lot of noise, but we might get different results if we run the tests on longer periods (e.g. weeks, months, etc.) or moving averages rather than on actual daily data.

bike$ConsPer <- ifelse(bike$yr == 1, 12,bike$yr)
bike$ConsPer <- bike$ConsPer + bike$mnth
AggMnth <- aggregate (cnt ~ ConsPer, bike, FUN = sum)
adf.test(AggMnth$cnt, alternative = "stationary", k=0)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AggMnth$cnt
## Dickey-Fuller = -0.68964, Lag order = 0, p-value = 0.9592
## alternative hypothesis: stationary

adfTest(AggMnth$cnt, lags = 0, type = "c")

## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 0
##   STATISTIC:
##     Dickey-Fuller: -2.0078
##   P VALUE:
##     0.3202 
## 
## Description:
##  Thu Mar 29 21:48:22 2018 by user: SS

In this case the results suggest that the null hypothesis cannot be rejected with confidence, thus the data may not stationary.

Note that in some cases the functions adf.test and adfTest may produce different result, that may even seem contradictory, e.g.:

plot(AirPassengers) #this series demonstrates a distinctly growing trend and seasonality

adf.test(AirPassengers, alternative="stationary", k=0) #however adf.test from tseries allows to conclude the data is stationery

## Warning in adf.test(AirPassengers, alternative = "stationary", k = 0): p-
## value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AirPassengers
## Dickey-Fuller = -4.6392, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

adfTest(AirPassengers, lags = 0, type = "c") #but this function suggests the oposite

## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 0
##   STATISTIC:
##     Dickey-Fuller: -1.7481
##   P VALUE:
##     0.4075 
## 
## Description:
##  Thu Mar 29 21:48:22 2018 by user: SS

#or here, showing line plot for quarterly earning per J&J share
plot(JohnsonJohnson) #distinctly growing trend, may also be some seasonality present

adf.test(JohnsonJohnson, alternative="stationary", k=0) #but this function adf.test indicates to stationarity of data

## Warning in adf.test(JohnsonJohnson, alternative = "stationary", k = 0): p-
## value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  JohnsonJohnson
## Dickey-Fuller = -4.4474, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

adfTest(JohnsonJohnson, lags = 0, type = "c") #whereas this one shows the opposite

## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 0
##   STATISTIC:
##     Dickey-Fuller: -1.2424
##   P VALUE:
##     0.5948 
## 
## Description:
##  Thu Mar 29 21:48:22 2018 by user: SS

The reason why in many cases adf.test indicates a significant stationarity of data (even when we clearly see on the plot that the series is trending) is because this function first removes the trend component from data, and then runs the test. Thus, if there is as trong trend present in a series but little or no seasonality and cycling, we may draw a wrong conclusion based on this function alone, and additional methods should be used.

One of such methods is checking ACF and PACF plots.

acf(AggMnth$cnt) #check how long a tail you get i.e. how many lags it takes for the bars to get below the blue line

#in this instance it is just 2 lags, so I conclude I can proceed with modeling without reworking my data
#take a note of the lags number as you will need it in modelling
pacf(AggMnth$cnt) #check if you are seeing some pattern here

# in case there is a long tail in ACF graph, or you are getting 2 or more high peaks in PACF that are repeating with some regularity, you will need to work with your data further to make it suitable for time series regression modelling
#methods may include log tranformation, or differencing

Train Model

As I will build regression model for aggregated monthly data AggMnth, I need to take a few more columns with variables from the original dataset, namely those for season, month, weathersit, temp, atemp, hum, windspeed, casual, registered.

install.packages(pkgs="plyr", repos = "http://streaming.stat.iastate.edu/CRAN")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## Warning: unable to access index for repository http://streaming.stat.iastate.edu/CRAN/src/contrib:
##   cannot open URL 'http://streaming.stat.iastate.edu/CRAN/src/contrib/PACKAGES'

## Warning: package 'plyr' is not available (for R version 3.4.4)

## Warning: unable to access index for repository http://streaming.stat.iastate.edu/CRAN/bin/windows/contrib/3.4:
##   cannot open URL 'http://streaming.stat.iastate.edu/CRAN/bin/windows/contrib/3.4/PACKAGES'

library(plyr)
install.packages("dplyr", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'dplyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:timeSeries':
## 
##     filter, lag

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

AggMnth$season <- bike$season[match(AggMnth$ConsPer, bike$ConsPer)]
AggMnth$mnth <- bike$mnth[match(AggMnth$ConsPer, bike$ConsPer)]

#duplicate data frame
bike1 <- as.data.frame(bike)
#add column with weather situation count by consequtive period
bike1 <- as.data.frame(ddply(bike,.(weathersit,ConsPer),transform,count = NROW(piece)))
#make a table for consequtive periods showing weathersit and count (weathersit frequency)
wmode <- as.data.frame(aggregate (count ~ weathersit + ConsPer, data = bike1, max))
#make table with only most frequent weathersit, for each period, total 24 rows
wmode1 <- as.data.frame(aggregate (count ~ ConsPer, data = wmode, max)) 
wmode1 <- wmode1 %>% left_join(wmode, by=c("ConsPer","count"))
#add column with weather mode to the database that will be used in model AggMnth
AggMnth$weathermode <- wmode1$weathersit[match(AggMnth$ConsPer, wmode1$ConsPer)]
#add remaining columns
bike_agg <- as.data.frame (aggregate(temp ~ ConsPer, bike, FUN=mean))
AggMnth$temp <- bike_agg$temp[match(AggMnth$ConsPer, bike_agg$ConsPer)]
bike_agg <- as.data.frame (aggregate(atemp ~ ConsPer, bike, FUN=mean))
AggMnth$atemp <- bike_agg$atemp[match(AggMnth$ConsPer, bike_agg$ConsPer)]
bike_agg <- as.data.frame (aggregate(hum ~ ConsPer, bike, FUN=mean))
AggMnth$hum <- bike_agg$hum[match(AggMnth$ConsPer, bike_agg$ConsPer)]
bike_agg <- as.data.frame (aggregate(windspeed ~ ConsPer, bike, FUN=mean))
AggMnth$windspeed <- bike_agg$windspeed[match(AggMnth$ConsPer, bike_agg$ConsPer)]
AggMnth$cnt <- as.numeric(AggMnth$cnt)
AggMnth$ConsPer <- as.integer(AggMnth$ConsPer)
AggMnth$cnt <- as.ts(AggMnth$cnt)
str(AggMnth)

## 'data.frame':    24 obs. of  9 variables:
##  $ ConsPer    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cnt        : Time-Series  from 1 to 24: 38189 48215 64045 94870 135821 ...
##  $ season     : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ mnth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weathermode: int  1 1 1 2 1 1 1 1 2 1 ...
##  $ temp       : num  0.198 0.283 0.332 0.471 0.577 ...
##  $ atemp      : num  0.204 0.284 0.325 0.457 0.551 ...
##  $ hum        : num  0.584 0.56 0.569 0.668 0.713 ...
##  $ windspeed  : num  0.195 0.229 0.232 0.244 0.181 ...

Now I will run functions lm, tslm, dynlm to create regression models.

install.packages("caret", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(caret)

## Loading required package: lattice

install.packages("dynlm", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'dynlm' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(dynlm)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:timeSeries':
## 
##     time<-

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

install.packages("forecast", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)

## package 'forecast' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages

library(forecast)
mod1_lm<-lm(cnt ~ season + mnth + weathermode + temp + atemp +
              hum + windspeed, data=AggMnth)
summary(mod1_lm)

## 
## Call:
## lm(formula = cnt ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed, data = AggMnth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70574 -23270   1190  27984  51150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   240204     205848   1.167    0.260
## season        -10316      31718  -0.325    0.749
## mnth            4112      10018   0.411    0.687
## weathermode    25802      30492   0.846    0.410
## temp        -1276037    1455241  -0.877    0.394
## atemp        1673208    1645334   1.017    0.324
## hum          -285484     204798  -1.394    0.182
## windspeed    -609904     671661  -0.908    0.377
## 
## Residual standard error: 39080 on 16 degrees of freedom
## Multiple R-squared:  0.6112, Adjusted R-squared:  0.441 
## F-statistic: 3.592 on 7 and 16 DF,  p-value: 0.01618

#R squared adjusted 0.441
mod2_tslm <- tslm(ts(AggMnth[,2:length(AggMnth)], start=1, end=24) ~ season + mnth + weathermode +
                    temp + atemp + hum + windspeed, data = AggMnth)
summary(mod2_tslm)

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Response cnt :
## 
## Call:
## tslm(formula = cnt ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed, data = AggMnth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70574 -23270   1190  27984  51150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   240204     205848   1.167    0.260
## season        -10316      31718  -0.325    0.749
## mnth            4112      10018   0.411    0.687
## weathermode    25802      30492   0.846    0.410
## temp        -1276037    1455241  -0.877    0.394
## atemp        1673208    1645334   1.017    0.324
## hum          -285484     204798  -1.394    0.182
## windspeed    -609904     671661  -0.908    0.377
## 
## Residual standard error: 39080 on 16 degrees of freedom
## Multiple R-squared:  0.6112, Adjusted R-squared:  0.441 
## F-statistic: 3.592 on 7 and 16 DF,  p-value: 0.01618
## 
## 
## Response season :
## 
## Call:
## tslm(formula = season ~ season + mnth + weathermode + temp + 
##     atemp + hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.719e-16 -5.943e-17 -4.090e-18  3.081e-17  5.644e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.088e-15  8.658e-16 -1.256e+00    0.227    
## season       1.000e+00  1.334e-16  7.496e+15   <2e-16 ***
## mnth         5.280e-18  4.213e-17  1.250e-01    0.902    
## weathermode -1.475e-17  1.282e-16 -1.150e-01    0.910    
## temp        -5.970e-16  6.121e-15 -9.800e-02    0.924    
## atemp        4.956e-16  6.920e-15  7.200e-02    0.944    
## hum         -5.959e-17  8.614e-16 -6.900e-02    0.946    
## windspeed    1.685e-15  2.825e-15  5.960e-01    0.559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.644e-16 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.587e+32 on 7 and 16 DF,  p-value: < 2.2e-16
## 
## 
## Response mnth :
## 
## Call:
## tslm(formula = mnth ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.830e-16 -1.231e-16 -4.330e-17  6.593e-17  1.513e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.833e-15  2.479e-15 -7.400e-01  0.47029    
## season      -1.297e-15  3.819e-16 -3.397e+00  0.00369 ** 
## mnth         1.000e+00  1.206e-16  8.290e+15  < 2e-16 ***
## weathermode -2.789e-16  3.672e-16 -7.600e-01  0.45851    
## temp         1.891e-14  1.752e-14  1.079e+00  0.29644    
## atemp       -2.126e-14  1.981e-14 -1.073e+00  0.29919    
## hum          1.275e-15  2.466e-15  5.170e-01  0.61222    
## windspeed    1.282e-14  8.088e-15  1.585e+00  0.13245    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.706e-16 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.845e+32 on 7 and 16 DF,  p-value: < 2.2e-16
## 
## 
## Response weathermode :
## 
## Call:
## tslm(formula = weathermode ~ season + mnth + weathermode + temp + 
##     atemp + hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.038e-17 -4.259e-18  1.089e-18  3.944e-18  8.017e-18 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  3.215e-19  3.149e-17  1.000e-02 0.991980    
## season      -9.646e-19  4.852e-18 -1.990e-01 0.844925    
## mnth         3.215e-19  1.532e-18  2.100e-01 0.836467    
## weathermode  1.000e+00  4.664e-18  2.144e+17  < 2e-16 ***
## temp        -9.778e-16  2.226e-16 -4.393e+00 0.000454 ***
## atemp        1.078e-15  2.517e-16  4.283e+00 0.000571 ***
## hum         -4.805e-17  3.133e-17 -1.534e+00 0.144596    
## windspeed   -3.551e-16  1.027e-16 -3.456e+00 0.003253 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.978e-18 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.333e+34 on 7 and 16 DF,  p-value: < 2.2e-16
## 
## 
## Response temp :
## 
## Call:
## tslm(formula = temp ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.259e-16 -8.516e-18 -3.119e-18  2.186e-17  3.794e-17 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  4.640e-16  2.181e-16  2.128e+00   0.0492 *  
## season       1.920e-17  3.360e-17  5.710e-01   0.5757    
## mnth         3.469e-18  1.061e-17  3.270e-01   0.7480    
## weathermode  7.968e-18  3.230e-17  2.470e-01   0.8083    
## temp         1.000e+00  1.542e-15  6.486e+14   <2e-16 ***
## atemp       -3.127e-15  1.743e-15 -1.794e+00   0.0917 .  
## hum         -4.232e-16  2.170e-16 -1.951e+00   0.0688 .  
## windspeed    9.386e-17  7.116e-16  1.320e-01   0.8967    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.14e-17 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.716e+31 on 7 and 16 DF,  p-value: < 2.2e-16
## 
## 
## Response atemp :
## 
## Call:
## tslm(formula = atemp ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.902e-18 -1.726e-19  2.285e-19  7.245e-19  2.693e-18 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  2.199e-17  1.285e-17  1.711e+00    0.106    
## season      -1.658e-18  1.980e-18 -8.370e-01    0.415    
## mnth         5.460e-19  6.254e-19  8.730e-01    0.396    
## weathermode  2.906e-19  1.903e-18  1.530e-01    0.881    
## temp         1.394e-16  9.085e-17  1.534e+00    0.145    
## atemp        1.000e+00  1.027e-16  9.736e+15   <2e-16 ***
## hum          1.973e-17  1.278e-17  1.543e+00    0.142    
## windspeed    3.302e-17  4.193e-17  7.870e-01    0.443    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.439e-18 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.27e+34 on 7 and 16 DF,  p-value: < 2.2e-16
## 
## 
## Response hum :
## 
## Call:
## tslm(formula = hum ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.504e-18 -4.225e-18 -8.109e-19  1.742e-18  2.836e-17 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.552e-16  4.752e-17  3.266e+00  0.00486 ** 
## season      -1.442e-18  7.322e-18 -1.970e-01  0.84636    
## mnth         1.578e-18  2.313e-18  6.820e-01  0.50471    
## weathermode  1.631e-17  7.039e-18  2.316e+00  0.03413 *  
## temp         9.137e-17  3.360e-16  2.720e-01  0.78913    
## atemp       -9.451e-17  3.798e-16 -2.490e-01  0.80666    
## hum          1.000e+00  4.728e-17  2.115e+16  < 2e-16 ***
## windspeed    2.219e-17  1.551e-16  1.430e-01  0.88800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.021e-18 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.583e+32 on 7 and 16 DF,  p-value: < 2.2e-16
## 
## 
## Response windspeed :
## 
## Call:
## tslm(formula = windspeed ~ season + mnth + weathermode + temp + 
##     atemp + hum + windspeed, data = AggMnth)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.798e-18 -6.291e-19  7.380e-20  8.896e-19  2.369e-18 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  3.644e-17  9.860e-18  3.696e+00  0.00196 ** 
## season      -1.441e-18  1.519e-18 -9.480e-01  0.35703    
## mnth        -2.779e-19  4.799e-19 -5.790e-01  0.57056    
## weathermode  4.288e-18  1.461e-18  2.936e+00  0.00970 ** 
## temp        -2.216e-16  6.971e-17 -3.178e+00  0.00584 ** 
## atemp        2.451e-16  7.881e-17  3.111e+00  0.00673 ** 
## hum         -3.172e-17  9.810e-18 -3.234e+00  0.00520 ** 
## windspeed    1.000e+00  3.217e-17  3.108e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.872e-18 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.749e+32 on 7 and 16 DF,  p-value: < 2.2e-16

#perfect fit, R squared equals 1, may be unreliable

mod3_dynlm <- dynlm(cnt ~ season + mnth + weathermode +
              temp + atemp + hum + windspeed +L(cnt,1) + L(cnt,2), data = AggMnth )
summary(mod3_dynlm)

## 
## Time series regression with "ts" data:
## Start = 3, End = 24
## 
## Call:
## dynlm(formula = cnt ~ season + mnth + weathermode + temp + atemp + 
##     hum + windspeed + L(cnt, 1) + L(cnt, 2), data = AggMnth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12049  -7632  -3418   4505  24804 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.737e+05  9.232e+04  -1.882 0.084341 .  
## season      -1.899e+04  1.101e+04  -1.725 0.110183    
## mnth         3.990e+02  3.621e+03   0.110 0.914072    
## weathermode -1.408e+04  1.090e+04  -1.292 0.220631    
## temp        -6.316e+05  5.374e+05  -1.175 0.262653    
## atemp        7.763e+05  6.184e+05   1.255 0.233284    
## hum          1.994e+05  8.471e+04   2.354 0.036434 *  
## windspeed    2.974e+05  2.697e+05   1.103 0.291736    
## L(cnt, 1)    1.008e+00  2.254e-01   4.472 0.000763 ***
## L(cnt, 2)    6.013e-03  2.094e-01   0.029 0.977566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12990 on 12 degrees of freedom
## Multiple R-squared:  0.9535, Adjusted R-squared:  0.9186 
## F-statistic: 27.33 on 9 and 12 DF,  p-value: 1.189e-06

#R squared 0.9186

Compare fitted values with actuals for each model.

#model 1 compared to actual
fit1 <- predict(mod1_lm)
plot(AggMnth$cnt,type="l",col="black",panel.first=grid(), xaxt="n",main="Actual vs Predicted, by Month")
axis(1, at = seq(0, 24, by = 1), las=2)
lines(fit1,col="red")

#model 2 compared to actual
fit2 <- predict(mod2_tslm)
lines(fit2[,1],col="blue")

#model 3 compared to actual
fit3 <- predict(mod3_dynlm)
lines(fit3,col="yellow")
legend("bottomright",legend=c("red - Model 1, R.sq 0.441","blue - Model 2, R.sq n/a","yellow - Model 3, R.sq 0.9186"))

Compare fitted values as table:

comp<-data.frame(Model1=numeric(24), Model2 = numeric(24), Model3 = numeric(24))
comp$Model1<-fit1
comp$Model2 <- fit2[,1]
comp$Model3 <- c("na","na",fit3)
comp

##       Model1    Model2           Model3
## 1   63659.77  63659.77               na
## 2   79961.05  79961.05               na
## 3   84795.62  84795.62 68942.3253663164
## 4  110915.44 110915.44 89454.2951277999
## 5  136779.39 136779.39 131392.495473351
## 6  189430.44 189430.44 149537.032603636
## 7  211915.37 211915.37 144686.792497106
## 8  169262.28 169262.28 146747.117514367
## 9  149954.48 149954.48 139466.829195578
## 10 124856.58 124856.58 121988.829171672
## 11 121578.69 121578.69 110248.597364227
## 12 123280.23 123280.23 74886.8661318742
## 13  68838.10  68838.10 103697.265692682
## 14  99798.71  99798.71 112059.231830227
## 15 125098.10 125098.10 140070.974558229
## 16 146853.26 146853.26 169694.051593729
## 17 153476.69 153476.69 204134.143577613
## 18 185690.25 185690.25 206320.902030303
## 19 200018.11 200018.11 198505.244006105
## 20 163352.51 163352.51 193438.508285241
## 21 174744.28 174744.28 224560.275346372
## 22 164473.43 164473.43 196264.301409918
## 23 124447.57 124447.57 160522.346956947
## 24 119498.66 119498.66 119656.574266704

Model 1 (lm) and Model 2 (tslm) produced same predictions; Model 3 (dynlm) is different, first three series do not have valued due to time series lag applied.

Comparison to ARIMA

As is observed from the below chart, ARIMA model is better fitting.

cnt <- as.ts(AggMnth$cnt)
mod4_arima<-auto.arima(cnt, seasonal=FALSE)
fit4 <-mod4_arima$fitted
fit4

## Time Series:
## Start = 1 
## End = 24 
## Frequency = 1 
##  [1]  38150.81  39765.67  53612.14  72566.51 111463.53 157865.49 147652.17
##  [8] 140172.32 134187.84 122426.21 121407.81  90677.23  79332.27 101815.46
## [15] 106578.44 198109.43 179256.70 207514.65 206579.36 204025.27 220368.47
## [22] 220763.94 188218.99 127806.28

plot(AggMnth$cnt,type="l",col="black",panel.first=grid(), xaxt="n",main="Actual vs Predicted, by Month")
axis(1, at = seq(0, 24, by = 1), las=2)
lines(fit1,col="red")
lines(fit1,col="red")
lines(fit3,col="yellow")
lines(fit4,col="green")
legend("bottomright",legend=c("red - Model 1","blue - Model 2","yellow - Model 3", "green - ARIMA model (default)"))

We can also compare residual sum of squares:

RES<-data.frame(Res_lm=numeric(24), Res_dynlm = numeric(24), Res_arima = numeric(24)) #tslm excluded here, due to unreliable model output (R squared equalled 1, no errors)
RES$Res_lm<-mod1_lm$residuals
RES$Res_lm <- RES$Res_lm*RES$Res_lm
RES$Res_dynlm <- c(0,0,mod3_dynlm$residuals)
RES$Res_dynlm <- RES$Res_dynlm * RES$Res_dynlm
RES$Res_arima <- mod4_arima$residuals
RES$Res_arima <- as.numeric(RES$Res_arima)
RES$Res_arima <- RES$Res_arima * RES$Res_arima
str(RES)

## 'data.frame':    24 obs. of  3 variables:
##  $ Res_lm   : num  6.49e+08 1.01e+09 4.31e+08 2.57e+08 9.19e+05 ...
##  $ Res_dynlm: num  0 0 23983796 29329859 19611652 ...
##  $ Res_arima: num  1.46e+03 7.14e+07 1.09e+08 4.97e+08 5.93e+08 ...

RES_comp <- as.list(c(sum(RES$Res_lm),sum(RES$Res_dynlm), sum(RES$Res_arima)))
RES_comp <- as.data.frame(RES_comp)
colnames(RES_comp) <- c("lm","dynlm","arima")

RES_comp

##            lm      dynlm      arima
## 1 24432157031 2023722224 8412477571

apply(RES_comp,1,min) #model dynlm had least residual sum of squares, followed by ARIMA, and lm performed worst

## [1] 2023722224

Conclusion

Linear regression modelling is a powerful tool to explore causal relationships between events and predict future outcomes. However when dealing with time series data the technique may appear not as efficient to provide best predictions (as compared e.g. to ARIMA or other methodologies specifically designed for time series analysis and prediction), but at the same time it gives a better understanding of significance of predictors and their impact on output. In many business contexts it may be reasonable to apply linear regression in combination with other tools.

References

[1] UCI Machine Learning Repository. Bike Sharing Dataset. https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset# Viewed March 29, 2018. [2] Dalinina, R. Introduction to Forecasting with ARIMA in R. Contributed to DATASCIENCE.COM. https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials Viewed March 29, 2018. [3] Nau, R. Stationarity and Differencing. Chapter from Statistical forecasting: notes on regression and time series analysis. https://people.duke.edu/~rnau/411diff.htm/ Viewed March 29, 2018.

STDS Vignette : Regression Modelling with Time Series Data in R

Svetlana Sapegina

29 March 2018