ghm=read.csv("mgh3p202122.csv",header=TRUE,sep=",")
ghm$X.3<-NULL
ghm$X.2<-NULL
ghm$X.1<-NULL
ghm$X<-NULL
ghm$Location<-NULL
ghm$Warehouse<-NULL
names(ghm)
## [1] "Places_total" "Places_free" "Places_busy" "date" "yyyy"
## [6] "month" "dd" "wekdays"
colnames(ghm)<-c("total","free","count","date","year","month","Monthday")
summary(ghm)
## total free count date year
## Min. :611 Min. :122.0 Min. :347.0 2021-01-01: 1 Min. :2021
## 1st Qu.:612 1st Qu.:162.0 1st Qu.:406.0 2021-01-02: 1 1st Qu.:2021
## Median :612 Median :183.0 Median :429.0 2021-01-03: 1 Median :2021
## Mean :612 Mean :184.3 Mean :427.7 2021-01-04: 1 Mean :2021
## 3rd Qu.:612 3rd Qu.:206.0 3rd Qu.:450.0 2021-01-05: 1 3rd Qu.:2022
## Max. :612 Max. :265.0 Max. :490.0 2021-01-06: 1 Max. :2022
## (Other) :662
## month Monthday NA
## Min. : 1.000 Min. : 1.00 dimanche:95
## 1st Qu.: 3.000 1st Qu.: 8.00 jeudi :95
## Median : 6.000 Median :16.00 lundi :95
## Mean : 6.084 Mean :15.69 mardi :96
## 3rd Qu.: 9.000 3rd Qu.:23.00 mercredi:96
## Max. :12.000 Max. :31.00 samedi :95
## vendredi:96
#counts are daily date
hist(ghm$count)
mean(ghm$count)
## [1] 427.7335
var(ghm$count)
## [1] 845.5361
This data are Poisson distributed with light overdispersion. Normal distribution might not be applicable
T=ts(ghm$count,start=c(2021),freq=365)
plot(T)
#note that daily data are particluarly difficult to set a ts in R Frequency is a important parameters to bserved when decomposition algroithms are envisoned.
With boxplot by either month, monthlyday by year comparision may help:
Note do not confuse : cyclicity != seasonality:
Cyclicality definition and meaning | Collins English Dictionary (collinsdictionary.com)
#Monthly Zooming
boxplot(ghm$count~ghm$month,col=rainbow(12))
boxplot(ghm$count[1:365]~ghm$month[1:365],main="Data2021",col=rainbow(12))
boxplot(ghm$count[366:668]~ghm$month[366:668],,main="Data2022",col=rainbow(12))
A rough cycle van be seen on summer-winter.
Using rolling mean can be helpfull in such situation:
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
tim=1:668
m30=rollmean(ghm$count,k=30)
plot(tim,ghm$count,type="l",lwd=2)
abline(v=c(30,60,90,120,150,180,210,240,270,300,330,360,390,420,450,480,510,540,570,600,630,660))
#The abline will give view ref.point if MAV m30 peak at data Hence m30 show a cyclicity= Here not univoquely
plot(m30,col=30,lwd=1,type="l",main="monthly MAV K=30")
#Using longer MAV give insight in a trend
#MonthlDaiy Zooming
boxplot(ghm$count~ghm$Monthday,col=rainbow(12))
boxplot(ghm$count[1:365]~ghm$Monthday[1:365],main="Data2021",col=rainbow(12))
boxplot(ghm$count[366:668]~ghm$Monthday[366:668],,main="Data2022",col=rainbow(12))
Hard to find any conclusive and affirmative seasonality & cyclicity.
Using MAV can be done again on shorter MAV resolution.
Now we should get read of the seasonal effects (Additive modeling) and left with trend and residuals.One way to doit is by removing MAV30 from the data:
#Brute substracting might not be aligned data!!!
plot(ghm$count-m30,type="l",col=2)
## Warning in ghm$count - m30: la taille d'un objet plus long n'est pas multiple de
## la taille d'un objet plus court
acf(ghm$count-m30)
## Warning in ghm$count - m30: la taille d'un objet plus long n'est pas multiple de
## la taille d'un objet plus court
It can be shown that Seasonality has not been efficiently removed simply as it is not existing.
Another way to asses this is to see the reduction in SD from series to reduced-series (Removed MAV):
sd(T)#must be = to
## [1] 29.0781
sd(ghm$count)
## [1] 29.0781
sd(m30)
## [1] 25.24538
sd(ghm$count-m30)
## Warning in ghm$count - m30: la taille d'un objet plus long n'est pas multiple de
## la taille d'un objet plus court
## [1] 17.67813
Although sd has been reduced it is not efficient : Usually after removing Seasonality (deseaoned series) a drop of more than 90 % of sd can be seen on other series (AirPasssenger i.e.)
Using decompose function might be misleading as:
-give you always a results
-depends also in the frequency of assigned series (recall that daily data show problematic behaviour in R ::Consider using ZOO of orther::)
TT=ts(ghm$count,start=c(2021),freq=12)
##Work for monthly data structure
plot(decompose(TT))
TTT=ts(ghm$count,start=c(2021),freq=4)
##Work for monthly data structure
plot(decompose(TTT))
So statistician should be aware of potential pitfall of automated algorithm (not Shown here Holt Winter fine tuning behavior of the serie).
The data is Poisson distributed with overdispersion and the data is
:
#NON STATIONARY &
#The data has a random walk behavior (not shown).
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(ghm$count)
##
## Augmented Dickey-Fuller Test
##
## data: ghm$count
## Dickey-Fuller = -3.3346, Lag order = 8, p-value = 0.06472
## alternative hypothesis: stationary
#pval is >5% Accept data are Non-staionary=Ho
#Other test of Stationarity show pronounced NON-Stat.