In any business environment, time series data is of core importance as it is what a company uses to describe their actual performance in the past and what they can use to predict the future.
In this vignette I will provide examples techniques and functions in R that can be used to analyse time series data and make predictions based on known actuals.
Data
The dataset used to create this vignette is about the daily statistics on bike hiring covering the period for 2011-2012, and it was borrowed from an open source UCI Machine Learning Repository [1]. This dataset was used in one very good tutorial on ARIMA modelling in R, but ARIMA will not be in the focus of my vignette (though the method is interesting and you can go through this tutorial here https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials [2]). To provide some examples I also used in-built data in base R (e.g. austres, AirPassengers, LakeHuron)
Beginning: Visualise and Explore
The first step (after loading the data) is to visualise it on a plot - this will provide you with insights on the patterns and trends in the data that are important to take into account for future modelling.
setwd("C:/Users/SS/Desktop/UTS/36103_Statistical_Thinking_for_Data_Science/R_wd")
bike <- read.csv("C:/Users/SS/Desktop/UTS/36103_Statistical_Thinking_for_Data_Science/R_wd/day.csv", header=TRUE, stringsAsFactors=FALSE)
head(bike)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606
Plot bike hiring count as a line plot.
bike$date <- as.Date(bike$dteday)
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(ggplot2)
ggplot(bike, aes(date, cnt)) + geom_line() + ylab("Daily Bike Hiring Count") +
xlab("date")
The data looks quite volatile, with distinct deeps/outliers; it seemed to be on the growing trend until October’2012 but then it started to decline. We may look at box plots for better understanding as it is often easier to see seasonality pattern or cycles in a box plot.
ggplot(bike, aes(y=cnt, x = weekday)) + geom_boxplot(aes(group=weekday)) + scale_x_continuous(breaks=seq(0,7,1)) + ggtitle("Bike Hiring Count by Day")
ggplot(bike, aes(y=cnt, x = mnth)) + geom_boxplot(aes(group=mnth)) + scale_x_continuous(breaks=seq(0,12,1), labels=c("","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) + ggtitle("Bike Hiring Count by Month")
One of the essential criteria for the time series data to be usable for a regression model is stationarity of data, which implies that neither the mean, nor variance, nor autocorrelation are a function of time (e.g. [3]). When we look at the graphs, we often can see if there is a trend or seasonality cycles in data, but running Dicker-Fuller test is a more reliable method to answer the question about stationarity. There are at least 2 functions know in R that we may use to run Dicker-Fuller test: adf.test function from tseries package and adfTest function from fUnitRoots. Let’s try them both and compare results.
install.packages("tseries", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'tseries' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(tseries)
adf.test(bike$cnt, alternative = "stationary", k=0)
## Warning in adf.test(bike$cnt, alternative = "stationary", k = 0): p-value
## smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: bike$cnt
## Dickey-Fuller = -10.052, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
install.packages("fUnitRoots", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'fUnitRoots' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(fUnitRoots)
## Loading required package: timeDate
## Loading required package: timeSeries
## Loading required package: fBasics
adfTest(bike$cnt, lags = 0, type = "c") #this function requires an argument for type - "c", "ct", or "nc" - let's do it with default type parameter "c" - which means presence of constant (intercept) but absence of time trend
## Warning in adfTest(bike$cnt, lags = 0, type = "c"): p-value smaller than
## printed p-value
##
## Title:
## Augmented Dickey-Fuller Test
##
## Test Results:
## PARAMETER:
## Lag Order: 0
## STATISTIC:
## Dickey-Fuller: -7.8083
## P VALUE:
## 0.01
##
## Description:
## Thu Mar 29 21:48:21 2018 by user: SS
Both tests returned small p-value below 0.01, which means that the alternative hypothesis is true and our data is stationary.
This could be expected as daily data normally contain a lot of noise, but we might get different results if we run the tests on longer periods (e.g. weeks, months, etc.) or moving averages rather than on actual daily data.
bike$ConsPer <- ifelse(bike$yr == 1, 12,bike$yr)
bike$ConsPer <- bike$ConsPer + bike$mnth
AggMnth <- aggregate (cnt ~ ConsPer, bike, FUN = sum)
adf.test(AggMnth$cnt, alternative = "stationary", k=0)
##
## Augmented Dickey-Fuller Test
##
## data: AggMnth$cnt
## Dickey-Fuller = -0.68964, Lag order = 0, p-value = 0.9592
## alternative hypothesis: stationary
adfTest(AggMnth$cnt, lags = 0, type = "c")
##
## Title:
## Augmented Dickey-Fuller Test
##
## Test Results:
## PARAMETER:
## Lag Order: 0
## STATISTIC:
## Dickey-Fuller: -2.0078
## P VALUE:
## 0.3202
##
## Description:
## Thu Mar 29 21:48:22 2018 by user: SS
In this case the results suggest that the null hypothesis cannot be rejected with confidence, thus the data may not stationary.
Note that in some cases the functions adf.test and adfTest may produce different result, that may even seem contradictory, e.g.:
plot(AirPassengers) #this series demonstrates a distinctly growing trend and seasonality
adf.test(AirPassengers, alternative="stationary", k=0) #however adf.test from tseries allows to conclude the data is stationery
## Warning in adf.test(AirPassengers, alternative = "stationary", k = 0): p-
## value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: AirPassengers
## Dickey-Fuller = -4.6392, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
adfTest(AirPassengers, lags = 0, type = "c") #but this function suggests the oposite
##
## Title:
## Augmented Dickey-Fuller Test
##
## Test Results:
## PARAMETER:
## Lag Order: 0
## STATISTIC:
## Dickey-Fuller: -1.7481
## P VALUE:
## 0.4075
##
## Description:
## Thu Mar 29 21:48:22 2018 by user: SS
#or here, showing line plot for quarterly earning per J&J share
plot(JohnsonJohnson) #distinctly growing trend, may also be some seasonality present
adf.test(JohnsonJohnson, alternative="stationary", k=0) #but this function adf.test indicates to stationarity of data
## Warning in adf.test(JohnsonJohnson, alternative = "stationary", k = 0): p-
## value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: JohnsonJohnson
## Dickey-Fuller = -4.4474, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
adfTest(JohnsonJohnson, lags = 0, type = "c") #whereas this one shows the opposite
##
## Title:
## Augmented Dickey-Fuller Test
##
## Test Results:
## PARAMETER:
## Lag Order: 0
## STATISTIC:
## Dickey-Fuller: -1.2424
## P VALUE:
## 0.5948
##
## Description:
## Thu Mar 29 21:48:22 2018 by user: SS
The reason why in many cases adf.test indicates a significant stationarity of data (even when we clearly see on the plot that the series is trending) is because this function first removes the trend component from data, and then runs the test. Thus, if there is as trong trend present in a series but little or no seasonality and cycling, we may draw a wrong conclusion based on this function alone, and additional methods should be used.
One of such methods is checking ACF and PACF plots.
acf(AggMnth$cnt) #check how long a tail you get i.e. how many lags it takes for the bars to get below the blue line
#in this instance it is just 2 lags, so I conclude I can proceed with modeling without reworking my data
#take a note of the lags number as you will need it in modelling
pacf(AggMnth$cnt) #check if you are seeing some pattern here
# in case there is a long tail in ACF graph, or you are getting 2 or more high peaks in PACF that are repeating with some regularity, you will need to work with your data further to make it suitable for time series regression modelling
#methods may include log tranformation, or differencing
Train Model
As I will build regression model for aggregated monthly data AggMnth, I need to take a few more columns with variables from the original dataset, namely those for season, month, weathersit, temp, atemp, hum, windspeed, casual, registered.
install.packages(pkgs="plyr", repos = "http://streaming.stat.iastate.edu/CRAN")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## Warning: unable to access index for repository http://streaming.stat.iastate.edu/CRAN/src/contrib:
## cannot open URL 'http://streaming.stat.iastate.edu/CRAN/src/contrib/PACKAGES'
## Warning: package 'plyr' is not available (for R version 3.4.4)
## Warning: unable to access index for repository http://streaming.stat.iastate.edu/CRAN/bin/windows/contrib/3.4:
## cannot open URL 'http://streaming.stat.iastate.edu/CRAN/bin/windows/contrib/3.4/PACKAGES'
library(plyr)
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:timeSeries':
##
## filter, lag
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
AggMnth$season <- bike$season[match(AggMnth$ConsPer, bike$ConsPer)]
AggMnth$mnth <- bike$mnth[match(AggMnth$ConsPer, bike$ConsPer)]
#duplicate data frame
bike1 <- as.data.frame(bike)
#add column with weather situation count by consequtive period
bike1 <- as.data.frame(ddply(bike,.(weathersit,ConsPer),transform,count = NROW(piece)))
#make a table for consequtive periods showing weathersit and count (weathersit frequency)
wmode <- as.data.frame(aggregate (count ~ weathersit + ConsPer, data = bike1, max))
#make table with only most frequent weathersit, for each period, total 24 rows
wmode1 <- as.data.frame(aggregate (count ~ ConsPer, data = wmode, max))
wmode1 <- wmode1 %>% left_join(wmode, by=c("ConsPer","count"))
#add column with weather mode to the database that will be used in model AggMnth
AggMnth$weathermode <- wmode1$weathersit[match(AggMnth$ConsPer, wmode1$ConsPer)]
#add remaining columns
bike_agg <- as.data.frame (aggregate(temp ~ ConsPer, bike, FUN=mean))
AggMnth$temp <- bike_agg$temp[match(AggMnth$ConsPer, bike_agg$ConsPer)]
bike_agg <- as.data.frame (aggregate(atemp ~ ConsPer, bike, FUN=mean))
AggMnth$atemp <- bike_agg$atemp[match(AggMnth$ConsPer, bike_agg$ConsPer)]
bike_agg <- as.data.frame (aggregate(hum ~ ConsPer, bike, FUN=mean))
AggMnth$hum <- bike_agg$hum[match(AggMnth$ConsPer, bike_agg$ConsPer)]
bike_agg <- as.data.frame (aggregate(windspeed ~ ConsPer, bike, FUN=mean))
AggMnth$windspeed <- bike_agg$windspeed[match(AggMnth$ConsPer, bike_agg$ConsPer)]
AggMnth$cnt <- as.numeric(AggMnth$cnt)
AggMnth$ConsPer <- as.integer(AggMnth$ConsPer)
AggMnth$cnt <- as.ts(AggMnth$cnt)
str(AggMnth)
## 'data.frame': 24 obs. of 9 variables:
## $ ConsPer : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cnt : Time-Series from 1 to 24: 38189 48215 64045 94870 135821 ...
## $ season : int 1 1 1 2 2 2 3 3 3 4 ...
## $ mnth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weathermode: int 1 1 1 2 1 1 1 1 2 1 ...
## $ temp : num 0.198 0.283 0.332 0.471 0.577 ...
## $ atemp : num 0.204 0.284 0.325 0.457 0.551 ...
## $ hum : num 0.584 0.56 0.569 0.668 0.713 ...
## $ windspeed : num 0.195 0.229 0.232 0.244 0.181 ...
Now I will run functions lm, tslm, dynlm to create regression models.
install.packages("caret", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'caret' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(caret)
## Loading required package: lattice
install.packages("dynlm", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'dynlm' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(dynlm)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:timeSeries':
##
## time<-
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
install.packages("forecast", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/SS/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'forecast' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\SS\AppData\Local\Temp\RtmpGQybt3\downloaded_packages
library(forecast)
mod1_lm<-lm(cnt ~ season + mnth + weathermode + temp + atemp +
hum + windspeed, data=AggMnth)
summary(mod1_lm)
##
## Call:
## lm(formula = cnt ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70574 -23270 1190 27984 51150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 240204 205848 1.167 0.260
## season -10316 31718 -0.325 0.749
## mnth 4112 10018 0.411 0.687
## weathermode 25802 30492 0.846 0.410
## temp -1276037 1455241 -0.877 0.394
## atemp 1673208 1645334 1.017 0.324
## hum -285484 204798 -1.394 0.182
## windspeed -609904 671661 -0.908 0.377
##
## Residual standard error: 39080 on 16 degrees of freedom
## Multiple R-squared: 0.6112, Adjusted R-squared: 0.441
## F-statistic: 3.592 on 7 and 16 DF, p-value: 0.01618
#R squared adjusted 0.441
mod2_tslm <- tslm(ts(AggMnth[,2:length(AggMnth)], start=1, end=24) ~ season + mnth + weathermode +
temp + atemp + hum + windspeed, data = AggMnth)
summary(mod2_tslm)
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Response cnt :
##
## Call:
## tslm(formula = cnt ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70574 -23270 1190 27984 51150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 240204 205848 1.167 0.260
## season -10316 31718 -0.325 0.749
## mnth 4112 10018 0.411 0.687
## weathermode 25802 30492 0.846 0.410
## temp -1276037 1455241 -0.877 0.394
## atemp 1673208 1645334 1.017 0.324
## hum -285484 204798 -1.394 0.182
## windspeed -609904 671661 -0.908 0.377
##
## Residual standard error: 39080 on 16 degrees of freedom
## Multiple R-squared: 0.6112, Adjusted R-squared: 0.441
## F-statistic: 3.592 on 7 and 16 DF, p-value: 0.01618
##
##
## Response season :
##
## Call:
## tslm(formula = season ~ season + mnth + weathermode + temp +
## atemp + hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.719e-16 -5.943e-17 -4.090e-18 3.081e-17 5.644e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.088e-15 8.658e-16 -1.256e+00 0.227
## season 1.000e+00 1.334e-16 7.496e+15 <2e-16 ***
## mnth 5.280e-18 4.213e-17 1.250e-01 0.902
## weathermode -1.475e-17 1.282e-16 -1.150e-01 0.910
## temp -5.970e-16 6.121e-15 -9.800e-02 0.924
## atemp 4.956e-16 6.920e-15 7.200e-02 0.944
## hum -5.959e-17 8.614e-16 -6.900e-02 0.946
## windspeed 1.685e-15 2.825e-15 5.960e-01 0.559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.644e-16 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.587e+32 on 7 and 16 DF, p-value: < 2.2e-16
##
##
## Response mnth :
##
## Call:
## tslm(formula = mnth ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.830e-16 -1.231e-16 -4.330e-17 6.593e-17 1.513e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.833e-15 2.479e-15 -7.400e-01 0.47029
## season -1.297e-15 3.819e-16 -3.397e+00 0.00369 **
## mnth 1.000e+00 1.206e-16 8.290e+15 < 2e-16 ***
## weathermode -2.789e-16 3.672e-16 -7.600e-01 0.45851
## temp 1.891e-14 1.752e-14 1.079e+00 0.29644
## atemp -2.126e-14 1.981e-14 -1.073e+00 0.29919
## hum 1.275e-15 2.466e-15 5.170e-01 0.61222
## windspeed 1.282e-14 8.088e-15 1.585e+00 0.13245
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.706e-16 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.845e+32 on 7 and 16 DF, p-value: < 2.2e-16
##
##
## Response weathermode :
##
## Call:
## tslm(formula = weathermode ~ season + mnth + weathermode + temp +
## atemp + hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.038e-17 -4.259e-18 1.089e-18 3.944e-18 8.017e-18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.215e-19 3.149e-17 1.000e-02 0.991980
## season -9.646e-19 4.852e-18 -1.990e-01 0.844925
## mnth 3.215e-19 1.532e-18 2.100e-01 0.836467
## weathermode 1.000e+00 4.664e-18 2.144e+17 < 2e-16 ***
## temp -9.778e-16 2.226e-16 -4.393e+00 0.000454 ***
## atemp 1.078e-15 2.517e-16 4.283e+00 0.000571 ***
## hum -4.805e-17 3.133e-17 -1.534e+00 0.144596
## windspeed -3.551e-16 1.027e-16 -3.456e+00 0.003253 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.978e-18 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.333e+34 on 7 and 16 DF, p-value: < 2.2e-16
##
##
## Response temp :
##
## Call:
## tslm(formula = temp ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.259e-16 -8.516e-18 -3.119e-18 2.186e-17 3.794e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.640e-16 2.181e-16 2.128e+00 0.0492 *
## season 1.920e-17 3.360e-17 5.710e-01 0.5757
## mnth 3.469e-18 1.061e-17 3.270e-01 0.7480
## weathermode 7.968e-18 3.230e-17 2.470e-01 0.8083
## temp 1.000e+00 1.542e-15 6.486e+14 <2e-16 ***
## atemp -3.127e-15 1.743e-15 -1.794e+00 0.0917 .
## hum -4.232e-16 2.170e-16 -1.951e+00 0.0688 .
## windspeed 9.386e-17 7.116e-16 1.320e-01 0.8967
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.14e-17 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.716e+31 on 7 and 16 DF, p-value: < 2.2e-16
##
##
## Response atemp :
##
## Call:
## tslm(formula = atemp ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.902e-18 -1.726e-19 2.285e-19 7.245e-19 2.693e-18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.199e-17 1.285e-17 1.711e+00 0.106
## season -1.658e-18 1.980e-18 -8.370e-01 0.415
## mnth 5.460e-19 6.254e-19 8.730e-01 0.396
## weathermode 2.906e-19 1.903e-18 1.530e-01 0.881
## temp 1.394e-16 9.085e-17 1.534e+00 0.145
## atemp 1.000e+00 1.027e-16 9.736e+15 <2e-16 ***
## hum 1.973e-17 1.278e-17 1.543e+00 0.142
## windspeed 3.302e-17 4.193e-17 7.870e-01 0.443
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.439e-18 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.27e+34 on 7 and 16 DF, p-value: < 2.2e-16
##
##
## Response hum :
##
## Call:
## tslm(formula = hum ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.504e-18 -4.225e-18 -8.109e-19 1.742e-18 2.836e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.552e-16 4.752e-17 3.266e+00 0.00486 **
## season -1.442e-18 7.322e-18 -1.970e-01 0.84636
## mnth 1.578e-18 2.313e-18 6.820e-01 0.50471
## weathermode 1.631e-17 7.039e-18 2.316e+00 0.03413 *
## temp 9.137e-17 3.360e-16 2.720e-01 0.78913
## atemp -9.451e-17 3.798e-16 -2.490e-01 0.80666
## hum 1.000e+00 4.728e-17 2.115e+16 < 2e-16 ***
## windspeed 2.219e-17 1.551e-16 1.430e-01 0.88800
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.021e-18 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.583e+32 on 7 and 16 DF, p-value: < 2.2e-16
##
##
## Response windspeed :
##
## Call:
## tslm(formula = windspeed ~ season + mnth + weathermode + temp +
## atemp + hum + windspeed, data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.798e-18 -6.291e-19 7.380e-20 8.896e-19 2.369e-18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.644e-17 9.860e-18 3.696e+00 0.00196 **
## season -1.441e-18 1.519e-18 -9.480e-01 0.35703
## mnth -2.779e-19 4.799e-19 -5.790e-01 0.57056
## weathermode 4.288e-18 1.461e-18 2.936e+00 0.00970 **
## temp -2.216e-16 6.971e-17 -3.178e+00 0.00584 **
## atemp 2.451e-16 7.881e-17 3.111e+00 0.00673 **
## hum -3.172e-17 9.810e-18 -3.234e+00 0.00520 **
## windspeed 1.000e+00 3.217e-17 3.108e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.872e-18 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.749e+32 on 7 and 16 DF, p-value: < 2.2e-16
#perfect fit, R squared equals 1, may be unreliable
mod3_dynlm <- dynlm(cnt ~ season + mnth + weathermode +
temp + atemp + hum + windspeed +L(cnt,1) + L(cnt,2), data = AggMnth )
summary(mod3_dynlm)
##
## Time series regression with "ts" data:
## Start = 3, End = 24
##
## Call:
## dynlm(formula = cnt ~ season + mnth + weathermode + temp + atemp +
## hum + windspeed + L(cnt, 1) + L(cnt, 2), data = AggMnth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12049 -7632 -3418 4505 24804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.737e+05 9.232e+04 -1.882 0.084341 .
## season -1.899e+04 1.101e+04 -1.725 0.110183
## mnth 3.990e+02 3.621e+03 0.110 0.914072
## weathermode -1.408e+04 1.090e+04 -1.292 0.220631
## temp -6.316e+05 5.374e+05 -1.175 0.262653
## atemp 7.763e+05 6.184e+05 1.255 0.233284
## hum 1.994e+05 8.471e+04 2.354 0.036434 *
## windspeed 2.974e+05 2.697e+05 1.103 0.291736
## L(cnt, 1) 1.008e+00 2.254e-01 4.472 0.000763 ***
## L(cnt, 2) 6.013e-03 2.094e-01 0.029 0.977566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12990 on 12 degrees of freedom
## Multiple R-squared: 0.9535, Adjusted R-squared: 0.9186
## F-statistic: 27.33 on 9 and 12 DF, p-value: 1.189e-06
#R squared 0.9186
Compare fitted values with actuals for each model.
#model 1 compared to actual
fit1 <- predict(mod1_lm)
plot(AggMnth$cnt,type="l",col="black",panel.first=grid(), xaxt="n",main="Actual vs Predicted, by Month")
axis(1, at = seq(0, 24, by = 1), las=2)
lines(fit1,col="red")
#model 2 compared to actual
fit2 <- predict(mod2_tslm)
lines(fit2[,1],col="blue")
#model 3 compared to actual
fit3 <- predict(mod3_dynlm)
lines(fit3,col="yellow")
legend("bottomright",legend=c("red - Model 1, R.sq 0.441","blue - Model 2, R.sq n/a","yellow - Model 3, R.sq 0.9186"))
Compare fitted values as table:
comp<-data.frame(Model1=numeric(24), Model2 = numeric(24), Model3 = numeric(24))
comp$Model1<-fit1
comp$Model2 <- fit2[,1]
comp$Model3 <- c("na","na",fit3)
comp
## Model1 Model2 Model3
## 1 63659.77 63659.77 na
## 2 79961.05 79961.05 na
## 3 84795.62 84795.62 68942.3253663164
## 4 110915.44 110915.44 89454.2951277999
## 5 136779.39 136779.39 131392.495473351
## 6 189430.44 189430.44 149537.032603636
## 7 211915.37 211915.37 144686.792497106
## 8 169262.28 169262.28 146747.117514367
## 9 149954.48 149954.48 139466.829195578
## 10 124856.58 124856.58 121988.829171672
## 11 121578.69 121578.69 110248.597364227
## 12 123280.23 123280.23 74886.8661318742
## 13 68838.10 68838.10 103697.265692682
## 14 99798.71 99798.71 112059.231830227
## 15 125098.10 125098.10 140070.974558229
## 16 146853.26 146853.26 169694.051593729
## 17 153476.69 153476.69 204134.143577613
## 18 185690.25 185690.25 206320.902030303
## 19 200018.11 200018.11 198505.244006105
## 20 163352.51 163352.51 193438.508285241
## 21 174744.28 174744.28 224560.275346372
## 22 164473.43 164473.43 196264.301409918
## 23 124447.57 124447.57 160522.346956947
## 24 119498.66 119498.66 119656.574266704
Model 1 (lm) and Model 2 (tslm) produced same predictions; Model 3 (dynlm) is different, first three series do not have valued due to time series lag applied.
Comparison to ARIMA
As is observed from the below chart, ARIMA model is better fitting.
cnt <- as.ts(AggMnth$cnt)
mod4_arima<-auto.arima(cnt, seasonal=FALSE)
fit4 <-mod4_arima$fitted
fit4
## Time Series:
## Start = 1
## End = 24
## Frequency = 1
## [1] 38150.81 39765.67 53612.14 72566.51 111463.53 157865.49 147652.17
## [8] 140172.32 134187.84 122426.21 121407.81 90677.23 79332.27 101815.46
## [15] 106578.44 198109.43 179256.70 207514.65 206579.36 204025.27 220368.47
## [22] 220763.94 188218.99 127806.28
plot(AggMnth$cnt,type="l",col="black",panel.first=grid(), xaxt="n",main="Actual vs Predicted, by Month")
axis(1, at = seq(0, 24, by = 1), las=2)
lines(fit1,col="red")
lines(fit1,col="red")
lines(fit3,col="yellow")
lines(fit4,col="green")
legend("bottomright",legend=c("red - Model 1","blue - Model 2","yellow - Model 3", "green - ARIMA model (default)"))
We can also compare residual sum of squares:
RES<-data.frame(Res_lm=numeric(24), Res_dynlm = numeric(24), Res_arima = numeric(24)) #tslm excluded here, due to unreliable model output (R squared equalled 1, no errors)
RES$Res_lm<-mod1_lm$residuals
RES$Res_lm <- RES$Res_lm*RES$Res_lm
RES$Res_dynlm <- c(0,0,mod3_dynlm$residuals)
RES$Res_dynlm <- RES$Res_dynlm * RES$Res_dynlm
RES$Res_arima <- mod4_arima$residuals
RES$Res_arima <- as.numeric(RES$Res_arima)
RES$Res_arima <- RES$Res_arima * RES$Res_arima
str(RES)
## 'data.frame': 24 obs. of 3 variables:
## $ Res_lm : num 6.49e+08 1.01e+09 4.31e+08 2.57e+08 9.19e+05 ...
## $ Res_dynlm: num 0 0 23983796 29329859 19611652 ...
## $ Res_arima: num 1.46e+03 7.14e+07 1.09e+08 4.97e+08 5.93e+08 ...
RES_comp <- as.list(c(sum(RES$Res_lm),sum(RES$Res_dynlm), sum(RES$Res_arima)))
RES_comp <- as.data.frame(RES_comp)
colnames(RES_comp) <- c("lm","dynlm","arima")
RES_comp
## lm dynlm arima
## 1 24432157031 2023722224 8412477571
apply(RES_comp,1,min) #model dynlm had least residual sum of squares, followed by ARIMA, and lm performed worst
## [1] 2023722224
Conclusion
Linear regression modelling is a powerful tool to explore causal relationships between events and predict future outcomes. However when dealing with time series data the technique may appear not as efficient to provide best predictions (as compared e.g. to ARIMA or other methodologies specifically designed for time series analysis and prediction), but at the same time it gives a better understanding of significance of predictors and their impact on output. In many business contexts it may be reasonable to apply linear regression in combination with other tools.
References
[1] UCI Machine Learning Repository. Bike Sharing Dataset. https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset# Viewed March 29, 2018. [2] Dalinina, R. Introduction to Forecasting with ARIMA in R. Contributed to DATASCIENCE.COM. https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials Viewed March 29, 2018. [3] Nau, R. Stationarity and Differencing. Chapter from Statistical forecasting: notes on regression and time series analysis. https://people.duke.edu/~rnau/411diff.htm/ Viewed March 29, 2018.