Summary of data:

happytails.ts = ts(happytails$Attendance, start=c(2012,1), end= c(2017,12), freq = 12)

summary(happytails$Attendance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.00   70.00   83.00   81.64   95.00  151.00       2

boxplot(happytails$Attendance, horizontal= TRUE)

Manually plotting the data and then using autoplot() to see the difference:

happytails.ts = ts(happytails$Attendance, start=c(2012,1), end= c(2017,12), freq = 12)

plot(happytails.ts, xlab="Year", ylab= "Attendance", ylim= c(0,200), bty= "l")

autoplot(happytails.ts)

There definitely appears to be some sort of pattern here, though it doesn’t appear to be seasonal as the pattern spans less than a year.

Let’s look at just the first 3 years of data to see the pattern more clearly:

#Plot first 3 years

ht_first3years= window(happytails.ts, 2012,c(2014,12))
plot(ht_first3years, xlab="Year", ylab= "Attendance", ylim= c(0,200), bty= "l")

autoplot(ht_first3years, main= "Autoplot")

We can more clearly see almost an “M” shape in the data, but again, it’s not on a yearly cycle.

Changing to a log scale to see if there’s any difference:

#no log scale
plot(happytails.ts, xlab="Year", ylab= "Attendance", ylim= c(0,200), bty= "l")

#log scale
plot(happytails.ts, log= "y", xlab="Year", ylab= "Attendance (log scale)", ylim= c(10,200), bty= "l", main= "Using Log Scale")

The data looks the same as when we manually plotted it.

Fitting a linear trend line to the data:

ht_linear_trend= tslm(happytails.ts ~ trend)

#plot linear line
plot(happytails.ts, xlab="Year", ylab= "Attendance", ylim= c(0,200), bty= "l")
lines(ht_linear_trend$fitted, lwd=2)

We can see more clearly with a fitted trend line that the data is linear and is slowly increasing over time.

Suppressing seasonality by aggregating by year:

ht_yearly_aggregate = aggregate(happytails.ts, nfrequency= 1, FUN= sum)

plot(ht_yearly_aggregate, bty= "l")

Plot each month separately:

# set the outer margin area to the right a bit bigger
par(oma = c(0, 0, 0, 6))

# We have 5 years of data
xrange <- c(1,6)
# Get the range of the ridership to set up a nicely formatted plot
yrange <- range(happytails.ts)

# set up the plot 
plot(xrange, yrange, type="n", xlab="Year", ylab="Attendance", bty="l")

# Give each of the months its own color, line type, and character
colors <- rainbow(12) 
linetype <- c(1:12) 
plotchar <- c(1:12)

# add lines 
for (i in 1:12) { 
  currentMonth <- subset(happytails.ts, cycle(happytails.ts)==i)
  lines(currentMonth, type="b", lwd=1.5,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) 
} 

# add a title
title("Attendance Broken Out by Month")

# add a legend 
legend(7,100, (1:12), cex=0.8, col=colors, pch=plotchar, lty=linetype)

Though the monthly plots can be a bit overwhelming at first sight, there are some good things we can pull off of it. For instance, it appears August 2012, March 2014, and July 2015 had the 3 lowest attendance counts from 2012-2017. January 2015, October 2016, and February 2017 had the highest attendance counts.

ggseasonplot()

Another approach is to use the ggseasonplot() function, which plots a line for each year, instead of each month as above. This can be useful for seeing which years did better than others. It’s evident from the below plot that 2017 had some of their highest attendance numbers, climbing above all other years multiple times, while 2012 and 2013 had some of the lowest attendance counts. This is to be expected as we saw from the initial graph of the data that the totals slowly increase as time goes on.

ggseasonplot(happytails.ts)

qqnorm(happytails.ts)
qqline(happytails.ts, col= "red")

Autocorrelation and White Noise:

A time series is considered white noise if the variables are uncorrelated, meaning they are independent of one another and the autocorrelation coefficients are close to zero. We first use the autocorrelation function ggAcf() to plot the correlation coefficients.

ggAcf(happytails.ts)

As we can see, there are significant autocorrelations at numerous Lags.

Now, to be sure, we’ll run a Ljung-Box test on the data to check for white noise. If it returns a small p-value, we’ll know it’s not white noise.

Box.test(happytails.ts, lag=24, fitdf=0, type= "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  happytails.ts
## X-squared = 135.29, df = 24, p-value < 2.2e-16

We can see that there is a very small p-value, meaning that the data aren’t white noise, and are in fact correlated.