TS Count data analysis

ghm=read.csv("mgh3p202122.csv",header=TRUE,sep=",")
ghm$X.3<-NULL
ghm$X.2<-NULL
ghm$X.1<-NULL
ghm$X<-NULL
ghm$Location<-NULL
ghm$Warehouse<-NULL
names(ghm)
## [1] "Places_total" "Places_free"  "Places_busy"  "date"         "yyyy"        
## [6] "month"        "dd"           "wekdays"
colnames(ghm)<-c("total","free","count","date","year","month","Monthday")
summary(ghm)
##      total          free           count               date          year     
##  Min.   :611   Min.   :122.0   Min.   :347.0   2021-01-01:  1   Min.   :2021  
##  1st Qu.:612   1st Qu.:162.0   1st Qu.:406.0   2021-01-02:  1   1st Qu.:2021  
##  Median :612   Median :183.0   Median :429.0   2021-01-03:  1   Median :2021  
##  Mean   :612   Mean   :184.3   Mean   :427.7   2021-01-04:  1   Mean   :2021  
##  3rd Qu.:612   3rd Qu.:206.0   3rd Qu.:450.0   2021-01-05:  1   3rd Qu.:2022  
##  Max.   :612   Max.   :265.0   Max.   :490.0   2021-01-06:  1   Max.   :2022  
##                                                (Other)   :662                 
##      month           Monthday            NA    
##  Min.   : 1.000   Min.   : 1.00   dimanche:95  
##  1st Qu.: 3.000   1st Qu.: 8.00   jeudi   :95  
##  Median : 6.000   Median :16.00   lundi   :95  
##  Mean   : 6.084   Mean   :15.69   mardi   :96  
##  3rd Qu.: 9.000   3rd Qu.:23.00   mercredi:96  
##  Max.   :12.000   Max.   :31.00   samedi  :95  
##                                   vendredi:96
#counts are daily date
hist(ghm$count)

mean(ghm$count)
## [1] 427.7335
var(ghm$count)
## [1] 845.5361

This data are Poisson distributed with light overdispersion. Normal distribution might not be applicable

T=ts(ghm$count,start=c(2021),freq=365)
plot(T)

#note that daily data are particluarly difficult to set a ts in R Frequency is a important parameters to bserved when decomposition algroithms are envisoned.

Looking for Seasonality

With boxplot by either month, monthlyday by year comparision may help:

Note do not confuse : cyclicity != seasonality:

Seasonality - Wikipedia

Cyclicality definition and meaning | Collins English Dictionary (collinsdictionary.com)

#Monthly Zooming
boxplot(ghm$count~ghm$month,col=rainbow(12))

boxplot(ghm$count[1:365]~ghm$month[1:365],main="Data2021",col=rainbow(12))

boxplot(ghm$count[366:668]~ghm$month[366:668],,main="Data2022",col=rainbow(12))

A rough cycle van be seen on summer-winter.

Using rolling mean can be helpfull in such situation:

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
tim=1:668
m30=rollmean(ghm$count,k=30)
plot(tim,ghm$count,type="l",lwd=2)
abline(v=c(30,60,90,120,150,180,210,240,270,300,330,360,390,420,450,480,510,540,570,600,630,660))

#The abline will give view ref.point if MAV m30 peak at data Hence m30 show a cyclicity= Here not univoquely
plot(m30,col=30,lwd=1,type="l",main="monthly MAV K=30")

#Using longer MAV give insight in a trend
#MonthlDaiy Zooming
boxplot(ghm$count~ghm$Monthday,col=rainbow(12))

boxplot(ghm$count[1:365]~ghm$Monthday[1:365],main="Data2021",col=rainbow(12))

boxplot(ghm$count[366:668]~ghm$Monthday[366:668],,main="Data2022",col=rainbow(12))

Hard to find any conclusive and affirmative seasonality & cyclicity.

Using MAV can be done again on shorter MAV resolution.

Now we should get read of the seasonal effects (Additive modeling) and left with trend and residuals.One way to doit is by removing MAV30 from the data:

#Brute substracting might not be aligned data!!!
plot(ghm$count-m30,type="l",col=2)
## Warning in ghm$count - m30: la taille d'un objet plus long n'est pas multiple de
## la taille d'un objet plus court

acf(ghm$count-m30)
## Warning in ghm$count - m30: la taille d'un objet plus long n'est pas multiple de
## la taille d'un objet plus court

It can be shown that Seasonality has not been efficiently removed simply as it is not existing.

Another way to asses this is to see the reduction in SD from series to reduced-series (Removed MAV):

sd(T)#must be = to
## [1] 29.0781
sd(ghm$count)
## [1] 29.0781
sd(m30)
## [1] 25.24538
sd(ghm$count-m30)
## Warning in ghm$count - m30: la taille d'un objet plus long n'est pas multiple de
## la taille d'un objet plus court
## [1] 17.67813

Although sd has been reduced it is not efficient : Usually after removing Seasonality (deseaoned series) a drop of more than 90 % of sd can be seen on other series (AirPasssenger i.e.)

Using decompose function might be misleading as:

-give you always a results

-depends also in the frequency of assigned series (recall that daily data show problematic behaviour in R ::Consider using ZOO of orther::)

TT=ts(ghm$count,start=c(2021),freq=12)
##Work for monthly data structure
plot(decompose(TT))

TTT=ts(ghm$count,start=c(2021),freq=4)
##Work for monthly data structure
plot(decompose(TTT))

So statistician should be aware of potential pitfall of automated algorithm (not Shown here Holt Winter fine tuning behavior of the serie).

Why it wont work?!

The data is Poisson distributed with overdispersion and the data is :
#NON STATIONARY &

#The data has a random walk behavior (not shown).

library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(ghm$count)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ghm$count
## Dickey-Fuller = -3.3346, Lag order = 8, p-value = 0.06472
## alternative hypothesis: stationary
#pval is >5% Accept data are Non-staionary=Ho
#Other test of Stationarity show pronounced NON-Stat.